- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns - Fix mutability, unused variables, and API compatibility issues - Validate SIMD tensor operations and backend architecture - Foundation now compiles cleanly and produces working binary |
||
---|---|---|
.. | ||
bench | ||
src | ||
zig-out/bin | ||
build.zig | ||
build.zig.zon | ||
README.md | ||
SETUP.md |
DeepZig V3 Implementation 🚀
A high-performance implementation of DeepSeek V3 in Zig for blazingly fast inference.
⚠️ Status: Experimental Foundation
This project provides a base foundation for DeepSeek V3 in Zig with:
- ✅ Working HTTP server with OpenAI-compatible API
- ✅ SIMD-optimized tensor operations (AVX2, NEON)
- ✅ Cross-platform build system (Zig 0.15.0-dev)
- ✅ Memory management and backend architecture
Not yet implemented: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing. See Development Status for details.
Overview
This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:
- Zero-cost abstractions with compile-time optimization
- Direct hardware access for SIMD and platform-specific optimizations
- Manual memory management without garbage collection pauses
- Single binary deployment with no runtime dependencies
- Cross-platform compilation for multiple architectures
Project Structure
experimental/
├── build.zig # Build system configuration
├── build.zig.zon # Package dependencies
├── src/
│ ├── main.zig # HTTP server entry point
│ ├── core/ # Core ML components
│ │ ├── root.zig # Module exports
│ │ ├── tensor.zig # SIMD-optimized tensors
│ │ ├── model.zig # DeepSeek V3 model
│ │ ├── attention.zig # MLA attention mechanism
│ │ ├── moe.zig # Mixture of Experts
│ │ ├── tokenizer.zig # Text tokenization
│ │ ├── backend.zig # Backend abstraction
│ │ ├── memory.zig # Memory management
│ │ └── math/ # Math utilities
│ │ ├── root.zig # Math module exports
│ │ ├── simd.zig # SIMD operations
│ │ ├── activation.zig # Activation functions
│ │ └── rms_norm.zig # RMS normalization
│ ├── web/ # HTTP API layer
│ │ ├── root.zig # Web module exports
│ │ ├── server.zig # HTTP server (std.http)
│ │ ├── handlers.zig # Request handlers
│ │ ├── middleware.zig # CORS, auth, rate limiting
│ │ ├── websocket.zig # WebSocket support
│ │ ├── openai.zig # OpenAI API compatibility
│ │ ├── request.zig # Request wrapper
│ │ └── response.zig # Response wrapper
│ ├── backends/ # Compute backends
│ │ ├── cpu/ # CPU with SIMD
│ │ ├── metal/ # Apple Silicon
│ │ └── cuda/ # NVIDIA GPUs
│ └── wasm/
│ └── main.zig # WebAssembly entry point
├── bench/
│ └── main.zig # Performance benchmarks
└── README.md # This file
Requirements
- Zig 0.15.0-dev or later
- Platform-specific requirements:
- macOS: Xcode Command Line Tools (for Metal backend)
- Linux: CUDA Toolkit (for CUDA backend, optional)
- Windows: CUDA Toolkit (for CUDA backend, optional)
Quick Start
Building
# Clone and navigate to experimental directory
cd experimental/
# Build the project
zig build
# Run the server
zig build run
# Run tests
zig build test
# Run benchmarks
zig build bench
# Build WebAssembly
zig build wasm
Running the Server
# Start server on default port (8080)
./zig-out/bin/deepseek-v3-zig
# Custom configuration
./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model
API Usage
The server exposes OpenAI-compatible endpoints:
# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# Health check
curl http://localhost:8080/health
# Model info
curl http://localhost:8080/v1/models
Performance Features
SIMD Optimizations
- x86_64: AVX2/AVX-512 vectorization for matrix operations
- ARM64: NEON SIMD for Apple Silicon optimization
- Auto-vectorization: Compiler-optimized loops with
@Vector
types
Backend Support
Backend | Status | Features |
---|---|---|
CPU | ✅ Implemented | Multi-threaded, SIMD, cache-optimized |
Metal | 🚧 In Progress | Apple Silicon GPU, unified memory |
CUDA | 🚧 Planned | NVIDIA GPU, Tensor Cores |
WebGPU | 📋 Future | Browser GPU acceleration |
Memory Management
- Arena allocators for request-scoped memory
- Memory pools for tensor allocations
- Zero-copy operations where possible
- Cache-friendly data layouts
Development Status
✅ Drafted
- Project structure and build system
- Core tensor operations with SIMD
- HTTP server with OpenAI API compatibility
- CPU backend with optimizations
- Memory management utilities
- Benchmark suite
🚧 In Progress
- DeepSeek V3 model architecture
- Multi-Head Latent Attention (MLA)
- Mixture of Experts (MoE) implementation
- Metal backend for Apple Silicon
- Model loading and weight management
📋 Planned
- CUDA backend for NVIDIA GPUs
- WebSocket streaming
- Model quantization (INT8, FP16)
- Flash Attention optimization
- Distributed inference
- Advanced sampling strategies
Architecture Decisions
Why Zig?
- Performance: Zero-cost abstractions without runtime overhead
- Memory Safety: Compile-time memory management without GC
- Simplicity: Single binary deployment, cross-compilation
- Control: Direct hardware access for optimization
Design Principles
- Modularity: Clean separation between core, web, and backend layers
- Performance: SIMD-first design with cache-friendly algorithms
- Compatibility: OpenAI API compatibility for easy adoption
- Extensibility: Plugin architecture for new backends
Contributing
This is an experimental project! Contributions are welcome:
- Core ML: Implement transformer layers, attention mechanisms
- Backends: Optimize CUDA/Metal compute kernels
- Performance: Profile and optimize bottlenecks
- Testing: Add comprehensive test coverage
- Documentation: Improve setup and usage guides
Development Setup
# Install Zig 0.15.0-dev
# https://ziglang.org/download/
# Clone repository
git clone [repository-url]
cd experimental/
# Run tests during development
zig build test --watch
# Format code
zig fmt src/
Benchmarks
Run benchmarks to measure performance:
zig build bench
Example output:
🚀 DeepZig V3 Performance Benchmarks
==========================================
Backend: CPU (SIMD optimized)
Architecture: x86_64
Thread count: 16
Operation | Iterations | Avg Time | Operations/s | Memory
-------------------------------|------------|-----------|--------------|-------
Tensor Creation (1024x1024) | 1000 iter | 0.05 ms | 20000000 ops/s | 4.0 MB
Tensor Addition (SIMD) | 100 iter | 0.12 ms | 35000000000 ops/s | 48.0 MB
Matrix Multiplication | 10 iter | 125.30 ms | 17.2 GFLOPS | 12.0 MB
Known Issues
- Model Loading: Currently creates dummy models - real weight loading not implemented
- Tokenizer: Placeholder implementation - needs proper BPE tokenizer
- WebSocket: Basic structure only - streaming not implemented
- Metal/CUDA: Backend stubs only - GPU kernels not implemented
License
This experimental implementation follows the same license as the original DeepSeek V3 project.
Resources
Is This Ready for Production?
No - this is a research/development foundation. But it's theoretical and compiles:
- What works now: ✅ Compiles with Zig 0.15.0-dev, tensor math, SIMD operations, benchmarks, backend architecture
- What's missing: HTTP server API update, actual DeepSeek V3 model implementation
- Timeline: Foundation is compiling, model implementation is the next major milestone
Comparison to Other Projects
Project | Language | Status | Focus |
---|---|---|---|
This | Zig | Foundation + API | Web-first inference |
llama.cpp | C++ | Production | CLI/library |
Candle | Rust | Production | ML framework |
ZML | Zig | Research | Low-level ML ops |
Unique advantages: Built-in web server, Zig's zero-cost abstractions, single binary deployment.
⚡ Built with Zig for blazing fast LLM inference!