✅ Implemented initial Apple Silicon detection using sysctl system calls ✅ Added proper M1/M2/M3/M4 generation detection via CPU brand string ✅ Fixed memory leaks that occured during dev with proper allocator cleanup ✅ Enhanced Metal backend foundation with device capabilities ✅ Added `test_m_series.zig` for hardware verification 🔧 Key Technical Improvements: - Real hardware detection via `hw.model` (eg; `MacBookPro17,1`) - CPU brand string parsing for accurate M-series identification - Unified memory strategy detection (even under Rosetta) - Apple Neural Engine capability detection - Memory-safe device info structures 🧪 Verified on Apple Silicon: - M1 correctly detected (generation 1, no variant) - 16GB unified memory properly identified - Builds cleanly with Zig `0.15.0-dev.703+597dd328e` - No false positives for M1 Pro/Max/Ultra variants 📋 Updated README status to reflect experimental draft implementation ⚠️ Clearly marked as research/development foundation, not production ready |
||
---|---|---|
.github | ||
experimental | ||
figures | ||
inference | ||
.gitignore | ||
dzv3-logo.svg | ||
LICENSE-CODE | ||
LICENSE-MODEL | ||
MACBOOK_SETUP.md | ||
README_WEIGHTS.md | ||
README-bak.md | ||
README-DEEPSEEK_LEGACY.md | ||
README.md |
DeepZig V3: A High-Performance LLM Architecture
Overview
A DRAFT proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
⚠️ Status: EXPERIMENTAL DRAFT ✅ Foundation compiles with Zig 0.15.0-dev, including:
- ✅ HTTP server framework (basic structure)
- ✅ SIMD-optimized tensor operations (draft implementation)
- ✅ Cross-platform backend architecture
- ✅ Initial memory management
- ✅ Apple Silicon M-series detection (real hardware detection via sysctl)
- ✅ Comprehensive build system draft
- ⚠️ NOT PRODUCTION READY - Draft implementation for research/development
Performance Note: Current naive algorithms are ~1000x slower than optimized BLAS. Matrix multiplication: 640ms for 1024×1024. This is expected for a foundational draft implementation. See experimental benchmarks for detailed performance data.
Why This Matters
Current LLM inference is dominated by Python/PyTorch, which introduces:
- Garbage collection pauses during generation
- Runtime overhead from dynamic dispatch
- Complex deployment with heavy runtimes
- Platform lock-in due to dependency complexity
Expected Benefits vs Current Reality
Aspect | Current (PyTorch) | Target (Zig) | Current Draft |
---|---|---|---|
Cold start | 10-30s | < 2s | Not measured |
Memory usage | 20-40GB | < 16GB | 16GB+ for basic ops |
Dependencies | ~2GB runtime | Single binary | ✅ Single binary |
Deployment | Complex | Copy & run | ✅ Copy & run |
Matrix Mul (1024×1024) | ~1ms (optimized) | < 1ms | 6418ms (naive) |
See experimental benchmarks for current performance measurements.
Why Zig?
Performance: Zero-cost abstractions, compile-time optimization, direct hardware access
Simplicity: Single static binary, no runtime dependencies, cross-compilation built-in
Web-First: Native HTTP server, WebAssembly compilation, efficient memory management
Proposed Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Web Layer │ │ Core Engine │ │ Backends │
│ │ │ │ │ │
│ ├─ HTTP API │◄──►│ ├─ Transformer │◄──►│ ├─ CPU (SIMD) │
│ ├─ WebSocket │ │ ├─ Attention │ │ ├─ Metal (macOS)│
│ ├─ Rate Limit │ │ ├─ MoE Routing │ │ ├─ CUDA (Linux) │
│ └─ Auth │ │ └─ Tokenizer │ │ └─ WebGPU │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Draft Web API Framework
Planned Endpoints (Basic Structure Implemented)
POST /v1/chat/completions
- OpenAI-compatible chat APIPOST /v1/completions
- Text completionGET /v1/models
- List available modelsGET /health
- Service health checkWebSocket /ws
- Streaming inference (planned)
Deployment Vision
- Static binaries - Single file deployment, no dependencies
- Direct VPS deployment - Copy binary and run with systemd
- Edge devices - ARM/RISC-V cross-compilation
- Serverless functions - Minimal cold start with static linking
- WebAssembly - Browser inference without additional runtime
Implementation Plan Status
Phase 1: Foundation ✅ DRAFT COMPLETE
- Set up Zig project structure
- Implement basic tensor operations with SIMD
- Create memory management system (arena allocators)
- Build HTTP server framework
- Apple Silicon detection via sysctl calls
- Updated to Zig 0.15.0-dev - compiles cleanly
- Benchmark suite showing current performance
📈 Performance baseline established - see benchmarks
Phase 2: Core Model (IN PROGRESS)
- Implement transformer layers
- Add Multi-Head Latent Attention (MLA)
- Build Mixture of Experts (MoE) routing
- Create tokenizer integration
Phase 3: Backends (PLANNED)
- Optimize CPU backend with AVX/NEON
- Integrate Metal for Apple Silicon
- Add CUDA support for NVIDIA GPUs
- Implement WebGPU for browsers
Phase 4: Web Integration (DRAFT STRUCTURE)
- Complete HTTP API implementation (basic structure)
- Add WebSocket streaming
- Build authentication/rate limiting
- Create deployment tooling
Technical Challenges
- Model Complexity: DeepSeek V3's MoE architecture requires careful memory management
- Backend Integration: Need efficient FFI to CUDA/Metal while maintaining performance
- Web Scale: Handle concurrent requests without blocking inference
- Accuracy: Match PyTorch numerical precision
- Performance: Current implementation is 1000x slower than optimised BLAS - major optimization needed
Platform-Specific Opportunities
Apple Silicon (M-Series) ✅ Draft Detection Implemented
- Metal Performance Shaders integration for matrix operations
- AMX instruction set access for accelerated linear algebra
- Unified memory architecture exploitation for zero-copy transfers
- Power efficiency tuning across P and E cores
- ✅ Proper M1/M2/M3/M4 detection via system calls
Current status: Hardware detection working, GPU acceleration not yet implemented.
x86_64 Architecture
- AVX-512 vectorization with masked operations
- Cache-friendly memory layouts for L1/L2/L3 optimization
- NUMA-aware allocation and thread assignment
- Dynamic dispatch based on runtime CPU feature detection
NVIDIA GPUs
- CUDA integration via efficient FFI bindings
- Tensor Core utilization for mixed-precision operations
- Custom kernels for attention mechanisms
- Memory pooling for reduced allocation overhead
Getting Started
Current Status: This repository contains a DRAFT EXPERIMENTAL Zig implementation foundation.
For the Current Zig Implementation:
# Clone this repository
git clone https://github.com/[current-repo-path]
cd DeepSeek-V3-Zig/experimental
# Build and test the foundation
zig build
# Run the HTTP server (basic structure)
zig build run -- --port 8080
# Run benchmarks (see actual performance)
zig build bench
# Test Apple Silicon detection
zig build-exe src/test_m_series.zig -I src -lc -framework Metal -framework Foundation
./test_m_series
📊 Performance Reality Check: See experimental/README.md for actual benchmark results showing current performance limitations and optimisation opportunities.
Development Approach
Following established Zig patterns:
- Arena allocators for request-scoped memory
- Error unions for explicit error handling
- Comptime generics for zero-cost abstractions
- SIMD vectors for numerical computation
Reference: Zig Cookbook for implementation patterns.
Seeking Contributors
This is an ambitious DRAFT project that would benefit from expertise in:
- Performance optimization (current bottleneck: naive matrix operations)
- Zig systems programming
- GPU kernel optimization (CUDA/Metal)
- ML model implementation
- Web server development
- Hardware-software co-design
- Novel inference techniques (Speculative decoding, quantization)
Current Limitations & Next Steps
🚧 What's Working: Compiles, runs, measures performance
⚠️ What's Missing: Optimized algorithms, robust flows, actual DeepSeek V3 model
📊 Performance Gap: 1000x slower than production systems
🎯 Next Priority: BLAS integration and GPU acceleration
See experimental implementation for technical details and current benchmarks.
References
- DeepZig V3 (Experimental Implementation) - Current working code
- DeepSeek V3 Paper - Original model architecture
- Zig Language - Language documentation
- Awesome Zig - Community resources
- Zig Patterns - Common idioms
- ZML - Zig Inference Stack
- LLaMA.cpp - C++ Inference Engine
- DeepZig Consciousness - Research goal/end game
Status: 🎯 EXPERIMENTAL DRAFT - Foundation compiles and runs basic operations (see benchmarks)
Vision: Foundation for advanced AI reasoning research
⚠️ Important: This is a research/development foundation with draft/base implementations. Not ready for production use.