# DeepZig V3 Implementation š
A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/) for blazingly fast inference.
> **ā ļø Status: Experimental Foundation**
>
> This project provides a **theoretical base foundation** for DeepZig V3 with draft implementation:
> - ā
**HTTP server** with OpenAI-compatible API
> - ā
**SIMD-optimized tensor operations** (AVX2, NEON)
> - ā
**Cross-platform build system** (Zig 0.15.0-dev)
> - ā
**Memory management** and backend architecture
> - ā
**Apple Silicon detection via sysctl calls**
>
> **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.
> **Performance Note**: Current implementation uses naive algorithms - matrix multiplication is ~1000x slower than optimized BLAS. See [benchmarks](#benchmarks) below.
>
> See [Development Status](#development-status) for details.
## Overview
This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:
- **Zero-cost abstractions** with compile-time optimization
- **Direct hardware access** for SIMD and platform-specific optimizations
- **Manual memory management** without garbage collection pauses
- **Single binary deployment** with no runtime dependencies
- **Cross-platform compilation** for multiple architectures
**š Related**: See the [main project README](../README.md) for architecture overview and vision.
## Project Structure
```
experimental/
āāā build.zig # Build system configuration
āāā build.zig.zon # Package dependencies
āāā src/
ā āāā main.zig # HTTP server entry point
ā āāā core/ # Core ML components
ā ā āāā root.zig # Module exports
ā ā āāā tensor.zig # SIMD-optimized tensors
ā ā āāā model.zig # DeepSeek V3 model
ā ā āāā attention.zig # MLA attention mechanism
ā ā āāā moe.zig # Mixture of Experts
ā ā āāā tokenizer.zig # Text tokenization
ā ā āāā backend.zig # Backend abstraction
ā ā āāā memory.zig # Memory management
ā ā āāā math/ # Math utilities
ā ā āāā root.zig # Math module exports
ā ā āāā simd.zig # SIMD operations
ā ā āāā activation.zig # Activation functions
ā ā āāā rms_norm.zig # RMS normalization
ā āāā web/ # HTTP API layer
ā ā āāā root.zig # Web module exports
ā ā āāā server.zig # HTTP server (std.http)
ā ā āāā handlers.zig # Request handlers
ā ā āāā middleware.zig # CORS, auth, rate limiting
ā ā āāā websocket.zig # WebSocket support
ā ā āāā openai.zig # OpenAI API compatibility
ā ā āāā request.zig # Request wrapper
ā ā āāā response.zig # Response wrapper
ā āāā backends/ # Compute backends
ā ā āāā cpu/ # CPU with SIMD
ā ā āāā metal/ # Apple Silicon
ā ā āāā cuda/ # NVIDIA GPUs
ā āāā wasm/
ā āāā main.zig # WebAssembly entry point
āāā bench/
ā āāā main.zig # Performance benchmarks
āāā README.md # This file
```
## Requirements
- **Zig 0.15.0-dev**
- Platform-specific requirements:
- **macOS**: Xcode Command Line Tools (for Metal backend)
- **Linux**: CUDA Toolkit (for CUDA backend, optional)
- **Windows**: CUDA Toolkit (for CUDA backend, optional)
## Quick Start
### Building
```bash
# Clone and navigate to experimental directory
cd experimental/
# Build the project
zig build
# Run the server
zig build run
# Run tests
zig build test
# Run benchmarks
zig build bench
# Build WebAssembly
zig build wasm
```
### Running the Server
```bash
# Start server on default port (8080)
./zig-out/bin/deepseek-v3-zig
# Custom configuration
./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model
```
### API Usage
The server exposes OpenAI-compatible endpoints:
```bash
# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
# Health check
curl http://localhost:8080/health
# Model info
curl http://localhost:8080/v1/models
```
## Performance Features
### SIMD Optimizations
- **x86_64**: AVX2/AVX-512 vectorization for matrix operations
- **ARM64**: NEON SIMD for Apple Silicon optimization
- **Auto-vectorization**: Compiler-optimized loops with `@Vector` types
### Backend Support
| Backend | Status | Features |
|---------|--------|----------|
| **CPU** | ā
Implemented | Multi-threaded, SIMD, cache-optimized |
| **Metal** | š§ In Progress | Apple Silicon GPU, unified memory |
| **CUDA** | š§ Planned | NVIDIA GPU, Tensor Cores |
| **WebGPU** | š Future | Browser GPU acceleration |
### Memory Management
- **Arena allocators** for request-scoped memory
- **Memory pools** for tensor allocations
- **Zero-copy operations** where possible
- **Cache-friendly** data layouts
## Development Status
### ā
Drafted
- [x] Project structure and build system
- [x] Core tensor operations with SIMD
- [x] HTTP server with OpenAI API compatibility
- [x] CPU backend with optimizations
- [x] Memory management utilities
- [x] Benchmark suite
### š§ In Progress
- [ ] DeepSeek V3 model architecture
- [ ] Multi-Head Latent Attention (MLA)
- [ ] Mixture of Experts (MoE) implementation
- [ ] Metal backend for Apple Silicon
- [ ] Model loading and weight management
### š Planned
- [ ] CUDA backend for NVIDIA GPUs
- [ ] WebSocket streaming
- [ ] Model quantization (INT8, FP16)
- [ ] Flash Attention optimization
- [ ] Distributed inference
- [ ] Advanced sampling strategies
## Architecture Decisions
### Why Zig?
1. **Performance**: Zero-cost abstractions without runtime overhead
2. **Memory Safety**: Compile-time memory management without GC
3. **Simplicity**: Single binary deployment, cross-compilation
4. **Control**: Direct hardware access for optimization
### Design Principles
- **Modularity**: Clean separation between core, web, and backend layers
- **Performance**: SIMD-first design with cache-friendly algorithms
- **Compatibility**: OpenAI API compatibility for easy adoption
- **Extensibility**: Plugin architecture for new backends
## Contributing
This is an experimental project! Contributions are welcome:
1. **Core ML**: Implement transformer layers, attention mechanisms
2. **Backends**: Optimize CUDA/Metal compute kernels
3. **Performance**: Profile and optimize bottlenecks
4. **Testing**: Add comprehensive test coverage
5. **Documentation**: Improve setup and usage guides
### Development Setup
```bash
# Install Zig 0.15.0-dev
# https://ziglang.org/download/
# Clone repository
git clone [repository-url]
cd experimental/
# Run tests during development
zig build test --watch
# Format code
zig fmt src/
```
## Benchmarks
Run benchmarks to measure performance:
```bash
zig build bench
```
Example output:
```
š DeepZig V3 Performance Benchmarks
==========================================
Backend: CPU (SIMD optimized)
Architecture: x86_64
Thread count: 16
Operation | Iterations | Avg Time | Operations/s | Memory
-------------------------------|------------|-----------|--------------|-------
Tensor Creation (1024x1024) | 1000 iter | 2.03 ms | 493 ops/s | 4.0 MB
Tensor Addition (SIMD) | 100 iter | 1.49 ms | 2806962690 ops/s | 48.0 MB
Matrix Multiplication | 10 iter | 6418.08 ms | 0 GFLOPS | 12.0 MB
SwiGLU Activation | 1000 iter | 4.44 ms | 236002478 ops/s | 12.0 MB
RMS Normalization (SIMD) | 1000 iter | 0.00 ms | 1077586 ops/s | 0.0 MB
Memory Bandwidth | 100 iter | 4.92 ms | 13 ops/s | 128.0 MB
```
## Known Issues
- **Model Loading**: Currently creates dummy models - real weight loading not implemented
- **Tokenizer**: Placeholder implementation - needs proper BPE tokenizer
- **WebSocket**: Basic structure only - streaming not implemented
- **Metal/CUDA**: Backend stubs only - GPU kernels not implemented
## License
This experimental implementation follows the same license as the original DeepSeek V3 project.
## Resources
- [Original DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
- [Zig Language Documentation](https://ziglang.org/documentation/master/)
- [Zig Performance Guide](https://github.com/ziglang/zig/wiki/Performance)
- [SIMD in Zig](https://ziglang.org/documentation/master/#Vectors)
## Is This Ready for Production?
**No** - this is a research/development foundation. But it's **theoretical and compiles**:
- **What works now**: ā
Compiles and runs with Zig 0.15.0-dev, HTTP server, tensor operations, SIMD math, benchmarks execute successfully
- **What's missing**: Optimized matrix operations, actual DeepSeek V3 model implementation
- **Timeline**: Foundation is **compiling**, model implementation is the next major milestone
## Comparison to Other Projects
| Project | Language | Status | Focus |
|---------|----------|--------|-------|
| **This** | Zig | Foundation + API | Web-first inference |
| llama.cpp | C++ | Production | CLI/library |
| Candle | Rust | Production | ML framework |
| ZML | Zig | Research | Low-level ML ops |
**Unique advantages**: Built-in web server, Zig's zero-cost abstractions, single binary deployment.
---
**ā” Built with Zig for blazing fast LLM inference!**
## Performance Notes
**Current Status**: The implementation prioritises initial **correctness and architecture** over performance. Key limitations:
- **Matrix Multiplication**: Uses naive O(n³) algorithm (~640ms for 1024Ć1024) - needs BLAS optimization
- **Debug Builds**: Running in debug mode - release builds will be faster
- **No GPU Acceleration**: CPU-only implementation - GPU backends will provide major speedups
**Expected Optimisations**: 100-1000x speedup possible with optimized BLAS, release builds, and GPU backends.