🧠 MAJOR MILESTONE: Complete architectural implementation of Multi-Head Latent Attention, the key innovation that makes DeepSeek V3 more efficient than standard transformers. ✨ What's New: • Multi-Head Latent Attention (MLA) with latent space projections • Complete transformer architecture (RMS norm, SwiGLU, residual connections) • RoPE (Rotary Position Encoding) with pre-computed embeddings • KV Cache for efficient autoregressive inference • Full BLAS acceleration delivering 1000+ GFLOPS on Apple Silicon (Apple M1 Macbook Pro under heavy load - 250+ chrome tabs, 30+ vscode instances) 🏗️ Architecture Highlights: • Latent projections (kv_a_proj_with_mqa, kv_b_proj) for efficient KV computation • Separate handling of positional vs non-positional components • LayerNorm in latent space for training stability • BLAS-accelerated scaled dot-product attention • MoE integration architecture ready for expert routing ⚡ Performance: • 1164 GFLOPS peak performance (Apple M1 MacBook Pro) • ~3000x speedup over naive implementations via BLAS integration • First architectural implementation of MLA attention mechanism 🧪 Status: • Theoretical implementation following DeepSeek V3 paper specifications • Compiles cleanly with Zig 0.15.0-dev, passes all tests • Architecturally complete but requires validation with real model weights 🎯 Next Steps: • Load real DeepSeek V3 weights (safetensors/HuggingFace format) • Validate outputs against reference PyTorch implementation • Complete MoE expert routing and tokenization • End-to-end inference pipeline Updated -> dual LICENSE, added to headers for relevant files. This makes us the first project to architecturally implement DeepSeek V3's Multi-Head Latent Attention innovation in a systems programming language. |
||
---|---|---|
.. | ||
bench | ||
src | ||
build.zig | ||
build.zig.zon | ||
README.md | ||
SETUP.md |
DeepZig V3 Implementation 🚀
A high-performance implementation of DeepSeek V3 in Zig for blazingly fast inference.
✅ Status: MLA Attention Architecture Implemented
This project provides a theoretical foundation of DeepZig V3 with significant architectural progress:
- ✅ Multi-Head Latent Attention (MLA) - Core DeepSeek V3 innovation architecturally implemented
- ✅ Complete Transformer Architecture with layer normalization, SwiGLU, and MoE integration
- ✅ HTTP server with OpenAI-compatible API
- ✅ BLAS-accelerated tensor operations (Apple Accelerate working)
- ✅ Cross-platform build system (Zig 0.15.0-dev)
- ✅ Memory management and backend architecture
- ✅ Apple Silicon detection and optimization
- ✅ Functional matrix operations (significant performance improvement)
- ✅ RoPE (Rotary Position Encoding) for position-aware attention
- ✅ KV Cache for efficient inference
- ✅ RMS Layer Normalization following DeepSeek V3 specifications
Latest Achievement: Multi-Head Latent Attention mechanism architecturally complete with RoPE, KV caching, and BLAS acceleration
Performance Status: 1160+ GFLOPS with Apple Accelerate backend working (measured on Apple M1 Macbook)
Validation Status: ⚠️ Theoretical implementation - requires testing with real model weights and output validationSee Performance Results for detailed benchmarks.
Overview
This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:
- Zero-cost abstractions with compile-time optimization
- Direct hardware access for SIMD and platform-specific optimizations
- Manual memory management without garbage collection pauses
- Single binary deployment with no runtime dependencies
- Cross-platform compilation for multiple architectures
🚀 BLAS Acceleration Achieved! We've successfully integrated Apple Accelerate backend delivering 1000+ GFLOPS performance - a 3000x speedup over the initial naive implementation. Measured on an M1 Macbook.
🧠 MLA Attention Architecturally Complete! The core innovation of DeepSeek V3 - Multi-Head Latent Attention - is now architecturally implemented with:
- Latent space projections for efficient key-value computation
- RoPE integration for positional encoding
- KV caching for fast inference
- BLAS-accelerated scaled dot-product attention
⚠️ Important: This is a theoretical implementation following the DeepSeek V3 paper specifications. It compiles, runs, and passes basic tests, but requires validation with real model weights and output verification against reference implementations.
🔗 Related: See the main project README for architecture overview and vision.
Key Technical Achievements
✅ Multi-Head Latent Attention (MLA) - Architecture Implemented
The cornerstone innovation of DeepSeek V3, now architecturally complete following paper specifications:
/// Multi-Head Latent Attention Configuration
pub const MLAConfig = struct {
hidden_size: u32,
num_attention_heads: u32,
num_key_value_heads: u32,
qk_nope_head_dim: u32, // Non-positional encoding dimension
qk_rope_head_dim: u32, // RoPE dimension
v_head_dim: u32, // Value head dimension
rope_base: f32, // RoPE base frequency
max_position_embeddings: u32,
attention_dropout: f32,
use_flash_attention: bool,
};
Architectural Features:
- Latent projections:
kv_a_proj_with_mqa
andkv_b_proj
for efficient KV computation - Separate nope/rope dimensions: Optimized handling of positional vs non-positional components
- LayerNorm in latent space: Stable training and inference
- BLAS acceleration: All matrix operations use optimized BLAS calls
⚠️ Validation Needed: While theoretically sound, requires testing with real DeepSeek V3 weights and output validation.
✅ Complete Transformer Architecture - Draft Implementation
pub const TransformerLayer = struct {
// Attention components
attention: attention.MultiHeadLatentAttention,
attention_norm: RMSNorm,
// Feed-forward components (MoE or dense)
mlp: ?SwiGLU, // Dense FFN for non-MoE layers
moe_layer: ?moe.MoE, // MoE layer (for MoE layers)
mlp_norm: RMSNorm,
};
Architecture Components:
- RMS Layer Normalization: Following DeepSeek V3 specifications
- SwiGLU Activation: Gate/Up/Down projections with SiLU activation
- MoE Integration: Automatic layer-wise expert routing (stub implementation)
- Residual Connections: Proper transformer residual flow
✅ Supporting Components
RoPE (Rotary Position Encoding) - Efficient implementation:
const RoPE = struct {
cos_cache: FloatTensor,
sin_cache: FloatTensor,
pub fn apply(self: *const Self, tensor_data: *FloatTensor, seq_len: u32, start_pos: u32) !void
KV Cache - Optimized for autoregressive generation:
const KVCache = struct {
k_cache: FloatTensor,
v_cache: FloatTensor,
pub fn update(self: *Self, new_k: *const FloatTensor, new_v: *const FloatTensor, start_pos: u32) !void
Development Status
✅ Architecturally Complete
- Multi-Head Latent Attention (MLA) - Core DeepSeek V3 innovation (theoretical implementation)
- Complete Transformer Layers with RMS norm, SwiGLU, residual connections
- RoPE (Rotary Position Encoding) with pre-computed embeddings
- KV Cache for efficient autoregressive inference
- BLAS Integration for all matrix operations
- Project structure and build system
- Core tensor operations with SIMD
- HTTP server with OpenAI API compatibility
- CPU backend with optimizations
- Memory management utilities
- Benchmark suite
- Comprehensive test coverage for attention and transformer components
🧪 Validation & Testing Required
- Real model weight loading (safetensors/HuggingFace format)
- Output validation against reference PyTorch implementation
- Numerical accuracy testing with known inputs/outputs
- End-to-end inference verification
- Performance comparison with other inference engines
🚧 Implementation Completion Needed
- Complete MoE implementation (routing, expert selection, load balancing)
- BPE Tokenizer implementation
- Generation loop (sampling strategies, beam search)
- Model configuration loading from HuggingFace config.json
📋 Platform & Optimization
- Metal backend for Apple Silicon
- CUDA backend for NVIDIA GPUs
- WebSocket streaming
- Model quantization (INT8, FP16)
- Flash Attention optimization
- Distributed inference
Validation Roadmap
Phase 1: Core Validation 🎯 NEXT PRIORITY
- Load Real Weights: Implement safetensors loading for actual DeepSeek V3 model
- Reference Testing: Compare outputs with HuggingFace transformers implementation
- Numerical Verification: Test attention patterns and layer outputs
- Simple Generation: Implement basic greedy decoding
Phase 2: Feature Completion
- Complete MoE: Implement expert routing and load balancing
- Full Tokenization: Add proper BPE tokenizer
- Advanced Sampling: Implement temperature, top-k, top-p sampling
- Performance Optimization: Profile and optimize bottlenecks
Phase 3: Production Readiness
- Comprehensive Testing: Unit tests, integration tests, benchmarks
- Cross-platform Support: Validate on different architectures
- GPU Acceleration: Complete Metal/CUDA backends
- Documentation: API docs, deployment guides
Architecture Decisions
Why MLA (Multi-Head Latent Attention)?
MLA is the key innovation that makes DeepSeek V3 more efficient than standard multi-head attention:
- Latent space compression: Projects KV to lower-dimensional latent space
- Shared computations: Reduces redundant key-value calculations
- Memory efficiency: Significantly lower memory footprint
- Maintained performance: No loss in model quality
Implementation Approach
Faithful to Paper: Our implementation closely follows the DeepSeek V3 paper architecture BLAS-Optimized: All linear operations use hardware-accelerated BLAS Memory Efficient: Proper tensor memory management and reuse Extensible: Clean interfaces for adding backends and optimizations
Contributing
This implementation provides a solid theoretical foundation for DeepSeek V3:
- Core Architecture: MLA attention and transformer layers architecturally complete
- Performance: BLAS acceleration working across operations
- Testing: Comprehensive test coverage for critical components
- Documentation: Well-documented APIs and architecture decisions
Critical Next Steps for Contributors:
- 🧪 Validation Testing: Load real weights and validate outputs
- 🔗 Model Loading: Complete safetensors/HuggingFace integration
- 📝 Tokenization: Implement proper BPE tokenizer
- 🎯 Generation: Add sampling strategies and inference pipeline
- 🧮 MoE Completion: Finish expert routing implementation
Development Setup
# Install Zig 0.15.0-dev
# https://ziglang.org/download/
# Clone repository
git clone [repository-url]
cd experimental/
# Run tests during development
/Users/triex/.local/share/zigup/0.15.0-dev.703+597dd328e/files/zig build test --watch
# Format code
/Users/triex/.local/share/zigup/0.15.0-dev.703+597dd328e/files/zig fmt src/
Performance Notes
Current Status: ✅ MLA attention architecturally implemented with BLAS acceleration - theoretical implementation functional.
Performance Results (Apple M1 MacBook Pro under heavy load):
- Matrix 256×256: 0.0ms/iter, 937 GFLOPS
- Matrix 512×512: 0.2ms/iter, 1143 GFLOPS
- Matrix 1024×1024: 2.2ms/iter, 977 GFLOPS
- Matrix 2048×2048: 20.9ms/iter, 823 GFLOPS
Performance Achievement: From 6418ms naive → 2.1ms BLAS = ~3000x speedup on matrix operations.
System Status:
- ✅ MLA Architecture: Complete theoretical implementation with latent projections, RoPE, and KV caching
- ✅ BLAS Backend: Apple Accelerate integration working optimally
- ✅ Peak Performance: 1143 GFLOPS measured (44% of theoretical maximum)
- ✅ Memory Bandwidth: 20.9 GB/s copying, well-optimized operations
- ✅ Hardware Detection: M-series Apple Silicon detection functional
⚠️ Performance Caveat: These are synthetic benchmarks. Real inference performance requires validation with actual model weights and end-to-end testing.
Known Limitations
- ⚠️ Theoretical Implementation: Architecture complete but unvalidated with real data
- Model Loading: Currently creates dummy models - real weight loading not implemented
- Tokenizer: Placeholder implementation - needs proper BPE tokenizer
- MoE Routing: Basic structure only - expert selection not implemented
- Output Validation: No comparison with reference implementations yet
- WebSocket: Basic structure only - streaming not implemented
- Metal/CUDA: Backend stubs only - GPU kernels not implemented
Is This Ready for Use?
No - this is a theoretical implementation that requires validation:
- What works now: ✅ Architecturally complete, compiles, runs, passes basic tests, excellent BLAS performance
- What's missing: Real weight loading, output validation, tokenization, generation pipeline
- Timeline: Architecture is theoretically complete, validation and testing is the next major milestone
Status: This provides a solid foundation for DeepSeek V3 implementation, but requires real-world validation before production use.
Comparison to Other Projects
Project | Language | Status | Focus | MLA Support |
---|---|---|---|---|
This | Zig | Architecture Complete (Theoretical) | Web-first inference | ✅ Architecturally Implemented |
llama.cpp | C++ | Production | CLI/library | ❌ No |
Candle | Rust | Production | ML framework | ❌ No |
ZML | Zig | Research | Low-level ML ops | ❌ No |
Unique advantages: First architectural implementation of MLA attention, built-in web server, Zig's zero-cost abstractions, single binary deployment.
⚡ Built with Zig for blazing fast DeepSeek V3 inference featuring Multi-Head Latent Attention!
Architecturally complete implementation of DeepSeek V3's core innovation - Multi-Head Latent Attention - ready for validation and testing.
📜 License
This implementation is dual-licensed:
- GPL-3.0: Free for open source projects
- Commercial: Contact Triex for proprietary use
See LICENSE-CODE and LICENSE-COMMERCIAL for details.