- Replace mocked performance estimates with actual measured results - Add `BenchmarkResults` struct to collect live performance data during execution - Implement honest dynamic summary showing real GFLOPS, timing, and bandwidth - Add transparent performance assessment based on measured values only - Display peak performance identification (1160 GFLOPS measured at 512×512) - Include real memory bandwidth (20.3 GB/s) and latency (1.8 ns) measurements - Remove misleading static efficiency percentages with live measurement system - Show clear distinction between measured performance and theoretical estimates - Provide actionable insights from Apple Accelerate backend performance Results: 1160 GFLOPS peak measured performance with honest assessment, eliminating misleading hardcoded comparisons in favor of real benchmark data. |
||
---|---|---|
.github | ||
experimental | ||
figures | ||
inference | ||
.gitignore | ||
dzv3-logo.svg | ||
LICENSE-CODE | ||
LICENSE-MODEL | ||
MACBOOK_SETUP.md | ||
README_WEIGHTS.md | ||
README-bak.md | ||
README-DEEPSEEK_LEGACY.md | ||
README.md |
DeepZig V3: A High-Performance LLM Architecture
Overview
A DRAFT proposal & foundation for implementing DeepSeek V3 in Zig to create a high-performance, web-ready LLM inference engine. This leverages Zig's unique advantages for systems programming while targeting modern deployment scenarios.
⚠️ Status: EXPERIMENTAL DRAFT ✅ Foundation compiles with Zig 0.15.0-dev, including:
- ✅ HTTP server framework (basic structure)
- ✅ SIMD-optimized tensor operations (draft implementation)
- ✅ Cross-platform backend architecture
- ✅ Initial memory management
- ✅ Apple Silicon M-series detection (hardware detection via sysctl)
- ✅ Comprehensive build system draft
- ✅ BLAS integration working (Apple Accelerate backend functional)
- ✅ Improved matrix operations (1000+ GFLOPS performance)
- ⚠️ NOT PRODUCTION READY - Draft implementation for research/development
Performance Update: Current naive algorithms are ~1000x slower than optimized BLAS BLAS integration now functional. Matrix multiplication: 2.1ms for 1024×1024 at 1000+ GFLOPS. This represents significant improvement over our initial naive implementation. See experimental benchmarks for detailed performance data.
Why This Matters
Current LLM inference is dominated by Python/PyTorch, which introduces:
- Garbage collection pauses during generation
- Runtime overhead from dynamic dispatch
- Complex deployment with heavy runtimes
- Platform lock-in due to dependency complexity
Progress Update: Our draft implementation now includes BLAS integration delivering improved matrix operation performance with Apple Accelerate backend.
Expected Benefits vs Current Reality
Aspect | Current (PyTorch) | Target (Zig) | Current Achievement |
---|---|---|---|
Cold start | 10-30s | < 2s | Not measured |
Memory usage | 20-40GB | < 16GB | 16GB+ for basic ops |
Dependencies | ~2GB runtime | Single binary | ✅ Single binary |
Deployment | Complex | Copy & run | ✅ Copy & run |
Matrix Mul (1024×1024) | ~1ms (optimized) | < 1ms | ✅ 2.1ms (1000+ GFLOPS) |
See experimental benchmarks for current performance measurements.
Why Zig?
Performance: Zero-cost abstractions, compile-time optimization, direct hardware access
Simplicity: Single static binary, no runtime dependencies, cross-compilation built-in
Web-First: Native HTTP server, WebAssembly compilation, efficient memory management
Proposed Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Web Layer │ │ Core Engine │ │ Backends │
│ │ │ │ │ │
│ ├─ HTTP API │◄──►│ ├─ Transformer │◄──►│ ├─ CPU (SIMD) │
│ ├─ WebSocket │ │ ├─ Attention │ │ ├─ Metal (macOS)│
│ ├─ Rate Limit │ │ ├─ MoE Routing │ │ ├─ CUDA (Linux) │
│ └─ Auth │ │ └─ Tokenizer │ │ └─ WebGPU │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Draft Web API Framework
Planned Endpoints (Basic Structure Implemented)
POST /v1/chat/completions
- OpenAI-compatible chat APIPOST /v1/completions
- Text completionGET /v1/models
- List available modelsGET /health
- Service health checkWebSocket /ws
- Streaming inference (planned)
Deployment Vision
- Static binaries - Single file deployment, no dependencies
- Direct VPS deployment - Copy binary and run with systemd
- Edge devices - ARM/RISC-V cross-compilation
- Serverless functions - Minimal cold start with static linking
- WebAssembly - Browser inference without additional runtime
Implementation Plan Status
Phase 1: Foundation ✅ DRAFT COMPLETE
- Set up Zig project structure
- Implement basic tensor operations with SIMD
- Create memory management system (arena allocators)
- Build HTTP server framework
- Apple Silicon detection via sysctl calls
- Updated to Zig 0.15.0-dev - compiles cleanly
- Benchmark suite showing current performance
- BLAS integration working - Apple Accelerate backend functional
- Improved matrix performance - 1000+ GFLOPS operations
📈 Performance improvement achieved - BLAS acceleration now working
Phase 2: Core Model (IN PROGRESS)
- Implement transformer layers
- Add Multi-Head Latent Attention (MLA)
- Build Mixture of Experts (MoE) routing
- Create tokenizer integration
Phase 3: Backends (PLANNED)
- Optimize CPU backend with AVX/NEON
- Integrate Metal for Apple Silicon
- Add CUDA support for NVIDIA GPUs
- Implement WebGPU for browsers
Phase 4: Web Integration (DRAFT STRUCTURE)
- Complete HTTP API implementation (basic structure)
- Add WebSocket streaming
- Build authentication/rate limiting
- Create deployment tooling
Technical Challenges
- Model Complexity: DeepSeek V3's MoE architecture requires careful memory management
- Backend Integration: Need efficient FFI to CUDA/Metal while maintaining performance
- Web Scale: Handle concurrent requests without blocking inference
- Accuracy: Match PyTorch numerical precision
- Performance: Matrix operations now use BLAS acceleration - focus shifts to model architecture optimisation
Platform-Specific Opportunities
Apple Silicon (M-Series) ✅ Draft Detection Implemented
- Metal Performance Shaders integration for matrix operations
- AMX instruction set access for accelerated linear algebra
- Unified memory architecture exploitation for zero-copy transfers
- Power efficiency tuning across P and E cores
- ✅ Proper M1/M2/M3/M4 detection via system calls
Current status: Hardware detection working, GPU acceleration not yet implemented.
x86_64 Architecture
- AVX-512 vectorization with masked operations
- Cache-friendly memory layouts for L1/L2/L3 optimization
- NUMA-aware allocation and thread assignment
- Dynamic dispatch based on runtime CPU feature detection
NVIDIA GPUs
- CUDA integration via efficient FFI bindings
- Tensor Core utilization for mixed-precision operations
- Custom kernels for attention mechanisms
- Memory pooling for reduced allocation overhead
Getting Started
Current Status: This repository contains a DRAFT EXPERIMENTAL Zig implementation foundation.
For the Current Zig Implementation:
# Clone this repository
git clone https://github.com/Triex/DeepZig-V3
cd DeepSeek-V3-Zig/experimental
# Build and test the foundation
zig build
# Run the HTTP server (basic structure)
zig build run -- --port 8080
# Run benchmarks (see actual performance)
zig build bench
# Test Apple Silicon detection
zig build-exe src/test_m_series.zig -I src -lc -framework Metal -framework Foundation
./test_m_series
📊 Performance Reality Check: See experimental/README.md for actual benchmark results showing current performance limitations and optimisation opportunities.
Development Approach
Following established Zig patterns:
- Arena allocators for request-scoped memory
- Error unions for explicit error handling
- Comptime generics for zero-cost abstractions
- SIMD vectors for numerical computation
Reference: Zig Cookbook for implementation patterns.
Seeking Contributors
This is an ambitious DRAFT project that would benefit from expertise in:
- Performance optimization (focus on transformer and attention mechanisms)
- Zig systems programming
- GPU kernel optimization (CUDA/Metal)
- ML model implementation
- Web server development
- Hardware-software co-design
- Novel inference techniques (Speculative decoding, quantization)
Current Limitations & Next Steps
🚧 What's Working: ✅ Compiles, runs, BLAS acceleration functional
⚠️ What's Missing: Robust flows, actual DeepSeek V3 model implementation
📊 Performance Status: ✅ Matrix operations improved (BLAS working)
🎯 Next Priority: DeepSeek V3 transformer architecture and attention mechanisms
See experimental implementation for technical details and current benchmarks.
References
- DeepZig V3 (Experimental Implementation) - Current working code
- DeepSeek V3 Paper - Original model architecture
- Zig Language - Language documentation
- Awesome Zig - Community resources
- Zig Patterns - Common idioms
- ZML - Zig Inference Stack
- LLaMA.cpp - C++ Inference Engine
- DeepZig Consciousness - Research goal/end game
Status: 🎯 EXPERIMENTAL DRAFT - Foundation compiles and runs basic operations (see benchmarks)
Vision: Foundation for advanced AI reasoning research
⚠️ Important: This is a research/development foundation with draft/base implementations. Not ready for production use.