# DeepZig V3 Implementation 🚀 A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/) for blazingly fast inference. > **⚠️ Status: Experimental Foundation** > > This project provides an **experimental foundation** for DeepZig V3 with working draft implementation: > - ✅ **HTTP server** with OpenAI-compatible API > - ✅ **BLAS-accelerated tensor operations** (Apple Accelerate working) > - ✅ **Cross-platform build system** (Zig 0.15.0-dev) > - ✅ **Memory management** and backend architecture > - ✅ **Apple Silicon detection and optimization** > - ✅ **Functional matrix operations** (significant performance improvement) > > **Recent Progress**: Matrix operations now use BLAS acceleration
> **Performance Status**: 1160+ GFLOPS with Apple Accelerate backend working (measured on Apple M1 Macbook)
> > See [Performance Results](#performance-notes) for detailed benchmarks. ## Overview This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine: - **Zero-cost abstractions** with compile-time optimization - **Direct hardware access** for SIMD and platform-specific optimizations - **Manual memory management** without garbage collection pauses - **Single binary deployment** with no runtime dependencies - **Cross-platform compilation** for multiple architectures **🚀 BLAS Acceleration Achieved!** We've successfully integrated Apple Accelerate backend delivering **1000+ GFLOPS** performance - a **3000x speedup** over the initial naive implementation. Measured on an M1 Macbook. **🔗 Related**: See the [main project README](../README.md) for architecture overview and vision. ## Project Structure ``` experimental/ ├── build.zig # Build system configuration ├── build.zig.zon # Package dependencies ├── src/ │ ├── main.zig # HTTP server entry point │ ├── core/ # Core ML components │ │ ├── root.zig # Module exports │ │ ├── tensor.zig # SIMD-optimized tensors │ │ ├── model.zig # DeepSeek V3 model │ │ ├── attention.zig # MLA attention mechanism │ │ ├── moe.zig # Mixture of Experts │ │ ├── tokenizer.zig # Text tokenization │ │ ├── backend.zig # Backend abstraction │ │ ├── memory.zig # Memory management │ │ └── math/ # Math utilities │ │ ├── root.zig # Math module exports │ │ ├── simd.zig # SIMD operations │ │ ├── activation.zig # Activation functions │ │ └── rms_norm.zig # RMS normalization │ ├── web/ # HTTP API layer │ │ ├── root.zig # Web module exports │ │ ├── server.zig # HTTP server (std.http) │ │ ├── handlers.zig # Request handlers │ │ ├── middleware.zig # CORS, auth, rate limiting │ │ ├── websocket.zig # WebSocket support │ │ ├── openai.zig # OpenAI API compatibility │ │ ├── request.zig # Request wrapper │ │ └── response.zig # Response wrapper │ ├── backends/ # Compute backends │ │ ├── cpu/ # CPU with SIMD │ │ ├── metal/ # Apple Silicon │ │ └── cuda/ # NVIDIA GPUs │ └── wasm/ │ └── main.zig # WebAssembly entry point ├── bench/ │ └── main.zig # Performance benchmarks └── README.md # This file ``` ## Requirements - **Zig 0.15.0-dev** - Platform-specific requirements: - **macOS**: Xcode Command Line Tools (for Metal backend) - **Linux**: CUDA Toolkit (for CUDA backend, optional) - **Windows**: CUDA Toolkit (for CUDA backend, optional) ## Quick Start ### Building ```bash # Clone and navigate to experimental directory cd experimental/ # Build the project zig build # Run the server zig build run # Run tests zig build test # Run benchmarks zig build bench # Build WebAssembly zig build wasm ``` ### Running the Server ```bash # Start server on default port (8080) ./zig-out/bin/deepseek-v3-zig # Custom configuration ./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model ``` ### API Usage The server exposes OpenAI-compatible endpoints: ```bash # Chat completion curl -X POST http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-v3", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 100 }' # Health check curl http://localhost:8080/health # Model info curl http://localhost:8080/v1/models ``` ## Performance Features ### SIMD Optimizations - **x86_64**: AVX2/AVX-512 vectorization for matrix operations - **ARM64**: NEON SIMD for Apple Silicon optimization - **Auto-vectorization**: Compiler-optimized loops with `@Vector` types ### Backend Support | Backend | Status | Features | |---------|--------|----------| | **CPU** | ✅ Implemented | Multi-threaded, SIMD, cache-optimized | | **Metal** | 🚧 In Progress | Apple Silicon GPU, unified memory | | **CUDA** | 🚧 Planned | NVIDIA GPU, Tensor Cores | | **WebGPU** | 📋 Future | Browser GPU acceleration | ### Memory Management - **Arena allocators** for request-scoped memory - **Memory pools** for tensor allocations - **Zero-copy operations** where possible - **Cache-friendly** data layouts ## Development Status ### ✅ Drafted - [x] Project structure and build system - [x] Core tensor operations with SIMD - [x] HTTP server with OpenAI API compatibility - [x] CPU backend with optimizations - [x] Memory management utilities - [x] Benchmark suite ### 🚧 In Progress - [ ] DeepSeek V3 model architecture - [ ] Multi-Head Latent Attention (MLA) - [ ] Mixture of Experts (MoE) implementation - [ ] Metal backend for Apple Silicon - [ ] Model loading and weight management ### 📋 Planned - [ ] CUDA backend for NVIDIA GPUs - [ ] WebSocket streaming - [ ] Model quantization (INT8, FP16) - [ ] Flash Attention optimization - [ ] Distributed inference - [ ] Advanced sampling strategies ## Architecture Decisions ### Why Zig? 1. **Performance**: Zero-cost abstractions without runtime overhead 2. **Memory Safety**: Compile-time memory management without GC 3. **Simplicity**: Single binary deployment, cross-compilation 4. **Control**: Direct hardware access for optimization ### Design Principles - **Modularity**: Clean separation between core, web, and backend layers - **Performance**: SIMD-first design with cache-friendly algorithms - **Compatibility**: OpenAI API compatibility for easy adoption - **Extensibility**: Plugin architecture for new backends ## Contributing This is an experimental project! Contributions are welcome: 1. **Core ML**: Implement transformer layers, attention mechanisms 2. **Backends**: Optimize CUDA/Metal compute kernels 3. **Performance**: Profile and optimize bottlenecks 4. **Testing**: Add comprehensive test coverage 5. **Documentation**: Improve setup and usage guides ### Development Setup ```bash # Install Zig 0.15.0-dev # https://ziglang.org/download/ # Clone repository git clone [repository-url] cd experimental/ # Run tests during development zig build test --watch # Format code zig fmt src/ ``` ## Benchmarks Run benchmarks to measure performance: ```bash zig build bench ``` **Hardware Context**: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build. Example output: ``` 🚀 DeepZig V3 Performance Benchmarks ========================================== 🎯 DYNAMIC BENCHMARK SUMMARY =============================== 📊 Matrix Multiplication Performance: • 256×256: 0.0 ms, 937 GFLOPS • 512×512: 0.2 ms, 1084 GFLOPS • 1024×1024: 2.1 ms, 1164 GFLOPS • 2048×2048: 20.9 ms, 823 GFLOPS 🏆 Peak measured: 1164 GFLOPS at 1024×1024 🧮 BLAS Configuration: • Backend: Apple Accelerate • Theoretical peak: 2600 GFLOPS (estimated) ➕ Tensor Operations: • SIMD Addition: 3.5 GB/s 💾 Memory Performance: • Copy Bandwidth: 20.9 GB/s • Random Access Latency: 1.8 ns 🎯 Performance Assessment: ✅ Acceptable: BLAS delivering 1000+ GFLOPS • Est. efficiency: 44% (vs theoretical peak) Note: Benchmarked on Apple M1 MacBook Pro under heavy load (should be significantly higher on a clean system). ``` **Performance Results** (Apple M1 MacBook Pro under heavy load): - **Matrix 256×256**: 0.0ms/iter, **937 GFLOPS** - **Matrix 512×512**: 0.2ms/iter, **1084 GFLOPS** (peak performance) - **Matrix 1024×1024**: 2.1ms/iter, **1164 GFLOPS** - **Matrix 2048×2048**: 20.9ms/iter, **823 GFLOPS** **Performance Achievement**: From **6418ms naive** → **2.2ms BLAS** = **2900x speedup** on matrix operations **System Status**: - ✅ **BLAS Backend**: Apple Accelerate integration delivering acceptable performance - ✅ **Peak Performance**: **1164 GFLOPS measured** (44% of theoretical maximum, impressive under load) - ✅ **Memory Bandwidth**: 20.9 GB/s copying, well-optimized operations - ✅ **Hardware Detection**: M-series Apple Silicon detection functional ## Known Issues - **Model Loading**: Currently creates dummy models - real weight loading not implemented - **Tokenizer**: Placeholder implementation - needs proper BPE tokenizer - **WebSocket**: Basic structure only - streaming not implemented - **Metal/CUDA**: Backend stubs only - GPU kernels not implemented ## License This experimental implementation follows the same license as the original DeepSeek V3 project. ## Resources - [Original DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437) - [Zig Language Documentation](https://ziglang.org/documentation/master/) - [Zig Performance Guide](https://github.com/ziglang/zig/wiki/Performance) - [SIMD in Zig](https://ziglang.org/documentation/master/#Vectors) ## Is This Ready for Production? **No** - this is a research/development foundation. But it's **theoretical and compiles**: - **What works now**: ✅ Compiles and runs with Zig 0.15.0-dev, HTTP server, tensor operations, SIMD math, benchmarks execute successfully - **What's missing**: Optimized matrix operations, actual DeepSeek V3 model implementation - **Timeline**: Foundation is **compiling**, model implementation is the next major milestone ## Comparison to Other Projects | Project | Language | Status | Focus | |---------|----------|--------|-------| | **This** | Zig | Foundation + API | Web-first inference | | llama.cpp | C++ | Production | CLI/library | | Candle | Rust | Production | ML framework | | ZML | Zig | Research | Low-level ML ops | **Unique advantages**: Built-in web server, Zig's zero-cost abstractions, single binary deployment. --- **⚡ Built with Zig for blazing fast LLM inference!** ## Performance Notes **Current Status**: ✅ **BLAS integration working** - Apple Accelerate backend now functional in draft implementation. **Performance Results** (Apple M1 MacBook Pro under heavy load): - **Matrix 256×256**: 0.0ms/iter, **937 GFLOPS** - **Matrix 512×512**: 0.2ms/iter, **1084 GFLOPS** - **Matrix 1024×1024**: 2.1ms/iter, **1164 GFLOPS** (peak performance) - **Matrix 2048×2048**: 20.9ms/iter, **823 GFLOPS** **Performance Achievement**: From **6418ms naive** → **2.1ms BLAS** = ~**3000x speedup** on matrix operations. **System Status**: - ✅ **BLAS Backend**: Apple Accelerate integration working - ✅ **Peak Performance**: **1164 GFLOPS measured** (44% of theoretical maximum) - ✅ **Memory Bandwidth**: 20.9 GB/s copying, well-optimized operations - ✅ **Hardware Detection**: M-series Apple Silicon detection functional **Next Steps**: Focus on transformer architecture, attention mechanisms, and model-specific optimizations for the draft DeepSeek V3 implementation.