mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-07-05 07:51:38 -04:00
348 lines
12 KiB
Markdown
348 lines
12 KiB
Markdown
# DeepZig V3 Implementation 🚀
|
||
|
||
A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/) for blazingly fast inference.
|
||
|
||
> **⚠️ Status: Experimental Foundation**
|
||
>
|
||
> This project provides an **experimental foundation** for DeepZig V3 with working draft implementation:
|
||
> - ✅ **HTTP server** with OpenAI-compatible API
|
||
> - ✅ **BLAS-accelerated tensor operations** (Apple Accelerate working)
|
||
> - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
|
||
> - ✅ **Memory management** and backend architecture
|
||
> - ✅ **Apple Silicon detection and optimization**
|
||
> - ✅ **Functional matrix operations** (significant performance improvement)
|
||
>
|
||
> **Recent Progress**: Matrix operations now use BLAS acceleration<br/>
|
||
> **Performance Status**: 1160+ GFLOPS with Apple Accelerate backend working (measured on Apple M1 Macbook)<br/>
|
||
>
|
||
> See [Performance Results](#performance-notes) for detailed benchmarks.
|
||
|
||
## Overview
|
||
|
||
This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:
|
||
|
||
- **Zero-cost abstractions** with compile-time optimization
|
||
- **Direct hardware access** for SIMD and platform-specific optimizations
|
||
- **Manual memory management** without garbage collection pauses
|
||
- **Single binary deployment** with no runtime dependencies
|
||
- **Cross-platform compilation** for multiple architectures
|
||
|
||
**🚀 BLAS Acceleration Achieved!** We've successfully integrated Apple Accelerate backend delivering **1000+ GFLOPS** performance - a **3000x speedup** over the initial naive implementation. Measured on an M1 Macbook.
|
||
|
||
**🔗 Related**: See the [main project README](../README.md) for architecture overview and vision.
|
||
|
||
## Project Structure
|
||
|
||
```
|
||
experimental/
|
||
├── build.zig # Build system configuration
|
||
├── build.zig.zon # Package dependencies
|
||
├── src/
|
||
│ ├── main.zig # HTTP server entry point
|
||
│ ├── core/ # Core ML components
|
||
│ │ ├── root.zig # Module exports
|
||
│ │ ├── tensor.zig # SIMD-optimized tensors
|
||
│ │ ├── model.zig # DeepSeek V3 model
|
||
│ │ ├── attention.zig # MLA attention mechanism
|
||
│ │ ├── moe.zig # Mixture of Experts
|
||
│ │ ├── tokenizer.zig # Text tokenization
|
||
│ │ ├── backend.zig # Backend abstraction
|
||
│ │ ├── memory.zig # Memory management
|
||
│ │ └── math/ # Math utilities
|
||
│ │ ├── root.zig # Math module exports
|
||
│ │ ├── simd.zig # SIMD operations
|
||
│ │ ├── activation.zig # Activation functions
|
||
│ │ └── rms_norm.zig # RMS normalization
|
||
│ ├── web/ # HTTP API layer
|
||
│ │ ├── root.zig # Web module exports
|
||
│ │ ├── server.zig # HTTP server (std.http)
|
||
│ │ ├── handlers.zig # Request handlers
|
||
│ │ ├── middleware.zig # CORS, auth, rate limiting
|
||
│ │ ├── websocket.zig # WebSocket support
|
||
│ │ ├── openai.zig # OpenAI API compatibility
|
||
│ │ ├── request.zig # Request wrapper
|
||
│ │ └── response.zig # Response wrapper
|
||
│ ├── backends/ # Compute backends
|
||
│ │ ├── cpu/ # CPU with SIMD
|
||
│ │ ├── metal/ # Apple Silicon
|
||
│ │ └── cuda/ # NVIDIA GPUs
|
||
│ └── wasm/
|
||
│ └── main.zig # WebAssembly entry point
|
||
├── bench/
|
||
│ └── main.zig # Performance benchmarks
|
||
└── README.md # This file
|
||
```
|
||
|
||
## Requirements
|
||
|
||
- **Zig 0.15.0-dev**
|
||
- Platform-specific requirements:
|
||
- **macOS**: Xcode Command Line Tools (for Metal backend)
|
||
- **Linux**: CUDA Toolkit (for CUDA backend, optional)
|
||
- **Windows**: CUDA Toolkit (for CUDA backend, optional)
|
||
|
||
## Quick Start
|
||
|
||
### Building
|
||
|
||
```bash
|
||
# Clone and navigate to experimental directory
|
||
cd experimental/
|
||
|
||
# Build the project
|
||
zig build
|
||
|
||
# Run the server
|
||
zig build run
|
||
|
||
# Run tests
|
||
zig build test
|
||
|
||
# Run benchmarks
|
||
zig build bench
|
||
|
||
# Build WebAssembly
|
||
zig build wasm
|
||
```
|
||
|
||
### Running the Server
|
||
|
||
```bash
|
||
# Start server on default port (8080)
|
||
./zig-out/bin/deepseek-v3-zig
|
||
|
||
# Custom configuration
|
||
./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model
|
||
```
|
||
|
||
### API Usage
|
||
|
||
The server exposes OpenAI-compatible endpoints:
|
||
|
||
```bash
|
||
# Chat completion
|
||
curl -X POST http://localhost:8080/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "deepseek-v3",
|
||
"messages": [{"role": "user", "content": "Hello!"}],
|
||
"max_tokens": 100
|
||
}'
|
||
|
||
# Health check
|
||
curl http://localhost:8080/health
|
||
|
||
# Model info
|
||
curl http://localhost:8080/v1/models
|
||
```
|
||
|
||
## Performance Features
|
||
|
||
### SIMD Optimizations
|
||
|
||
- **x86_64**: AVX2/AVX-512 vectorization for matrix operations
|
||
- **ARM64**: NEON SIMD for Apple Silicon optimization
|
||
- **Auto-vectorization**: Compiler-optimized loops with `@Vector` types
|
||
|
||
### Backend Support
|
||
|
||
| Backend | Status | Features |
|
||
|---------|--------|----------|
|
||
| **CPU** | ✅ Implemented | Multi-threaded, SIMD, cache-optimized |
|
||
| **Metal** | 🚧 In Progress | Apple Silicon GPU, unified memory |
|
||
| **CUDA** | 🚧 Planned | NVIDIA GPU, Tensor Cores |
|
||
| **WebGPU** | 📋 Future | Browser GPU acceleration |
|
||
|
||
### Memory Management
|
||
|
||
- **Arena allocators** for request-scoped memory
|
||
- **Memory pools** for tensor allocations
|
||
- **Zero-copy operations** where possible
|
||
- **Cache-friendly** data layouts
|
||
|
||
## Development Status
|
||
|
||
### ✅ Drafted
|
||
- [x] Project structure and build system
|
||
- [x] Core tensor operations with SIMD
|
||
- [x] HTTP server with OpenAI API compatibility
|
||
- [x] CPU backend with optimizations
|
||
- [x] Memory management utilities
|
||
- [x] Benchmark suite
|
||
|
||
### 🚧 In Progress
|
||
- [ ] DeepSeek V3 model architecture
|
||
- [ ] Multi-Head Latent Attention (MLA)
|
||
- [ ] Mixture of Experts (MoE) implementation
|
||
- [ ] Metal backend for Apple Silicon
|
||
- [ ] Model loading and weight management
|
||
|
||
### 📋 Planned
|
||
- [ ] CUDA backend for NVIDIA GPUs
|
||
- [ ] WebSocket streaming
|
||
- [ ] Model quantization (INT8, FP16)
|
||
- [ ] Flash Attention optimization
|
||
- [ ] Distributed inference
|
||
- [ ] Advanced sampling strategies
|
||
|
||
## Architecture Decisions
|
||
|
||
### Why Zig?
|
||
|
||
1. **Performance**: Zero-cost abstractions without runtime overhead
|
||
2. **Memory Safety**: Compile-time memory management without GC
|
||
3. **Simplicity**: Single binary deployment, cross-compilation
|
||
4. **Control**: Direct hardware access for optimization
|
||
|
||
### Design Principles
|
||
|
||
- **Modularity**: Clean separation between core, web, and backend layers
|
||
- **Performance**: SIMD-first design with cache-friendly algorithms
|
||
- **Compatibility**: OpenAI API compatibility for easy adoption
|
||
- **Extensibility**: Plugin architecture for new backends
|
||
|
||
## Contributing
|
||
|
||
This is an experimental project! Contributions are welcome:
|
||
|
||
1. **Core ML**: Implement transformer layers, attention mechanisms
|
||
2. **Backends**: Optimize CUDA/Metal compute kernels
|
||
3. **Performance**: Profile and optimize bottlenecks
|
||
4. **Testing**: Add comprehensive test coverage
|
||
5. **Documentation**: Improve setup and usage guides
|
||
|
||
### Development Setup
|
||
|
||
```bash
|
||
# Install Zig 0.15.0-dev
|
||
# https://ziglang.org/download/
|
||
|
||
# Clone repository
|
||
git clone [repository-url]
|
||
cd experimental/
|
||
|
||
# Run tests during development
|
||
zig build test --watch
|
||
|
||
# Format code
|
||
zig fmt src/
|
||
```
|
||
|
||
## Benchmarks
|
||
|
||
Run benchmarks to measure performance:
|
||
|
||
```bash
|
||
zig build bench
|
||
```
|
||
|
||
**Hardware Context**: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build.
|
||
|
||
Example output:
|
||
```
|
||
🚀 DeepZig V3 Performance Benchmarks
|
||
==========================================
|
||
|
||
🎯 DYNAMIC BENCHMARK SUMMARY
|
||
===============================
|
||
|
||
📊 Matrix Multiplication Performance:
|
||
• 256×256: 0.0 ms, 937 GFLOPS
|
||
• 512×512: 0.2 ms, 1084 GFLOPS
|
||
• 1024×1024: 2.1 ms, 1164 GFLOPS
|
||
• 2048×2048: 20.9 ms, 823 GFLOPS
|
||
🏆 Peak measured: 1164 GFLOPS at 1024×1024
|
||
|
||
🧮 BLAS Configuration:
|
||
• Backend: Apple Accelerate
|
||
• Theoretical peak: 2600 GFLOPS (estimated)
|
||
|
||
➕ Tensor Operations:
|
||
• SIMD Addition: 3.5 GB/s
|
||
|
||
💾 Memory Performance:
|
||
• Copy Bandwidth: 20.9 GB/s
|
||
• Random Access Latency: 1.8 ns
|
||
|
||
🎯 Performance Assessment:
|
||
✅ Acceptable: BLAS delivering 1000+ GFLOPS
|
||
• Est. efficiency: 44% (vs theoretical peak)
|
||
|
||
Note: Benchmarked on Apple M1 MacBook Pro under heavy load
|
||
(should be significantly higher on a clean system).
|
||
```
|
||
|
||
**Performance Results** (Apple M1 MacBook Pro under heavy load):
|
||
- **Matrix 256×256**: 0.0ms/iter, **937 GFLOPS**
|
||
- **Matrix 512×512**: 0.2ms/iter, **1084 GFLOPS** (peak performance)
|
||
- **Matrix 1024×1024**: 2.1ms/iter, **1164 GFLOPS**
|
||
- **Matrix 2048×2048**: 20.9ms/iter, **823 GFLOPS**
|
||
|
||
**Performance Achievement**: From **6418ms naive** → **2.2ms BLAS** = **2900x speedup** on matrix operations
|
||
|
||
**System Status**:
|
||
- ✅ **BLAS Backend**: Apple Accelerate integration delivering acceptable performance
|
||
- ✅ **Peak Performance**: **1164 GFLOPS measured** (44% of theoretical maximum, impressive under load)
|
||
- ✅ **Memory Bandwidth**: 20.9 GB/s copying, well-optimized operations
|
||
- ✅ **Hardware Detection**: M-series Apple Silicon detection functional
|
||
|
||
## Known Issues
|
||
|
||
- **Model Loading**: Currently creates dummy models - real weight loading not implemented
|
||
- **Tokenizer**: Placeholder implementation - needs proper BPE tokenizer
|
||
- **WebSocket**: Basic structure only - streaming not implemented
|
||
- **Metal/CUDA**: Backend stubs only - GPU kernels not implemented
|
||
|
||
## License
|
||
|
||
This experimental implementation follows the same license as the original DeepSeek V3 project.
|
||
|
||
## Resources
|
||
|
||
- [Original DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
|
||
- [Zig Language Documentation](https://ziglang.org/documentation/master/)
|
||
- [Zig Performance Guide](https://github.com/ziglang/zig/wiki/Performance)
|
||
- [SIMD in Zig](https://ziglang.org/documentation/master/#Vectors)
|
||
|
||
## Is This Ready for Production?
|
||
|
||
**No** - this is a research/development foundation. But it's **theoretical and compiles**:
|
||
|
||
- **What works now**: ✅ Compiles and runs with Zig 0.15.0-dev, HTTP server, tensor operations, SIMD math, benchmarks execute successfully
|
||
- **What's missing**: Optimized matrix operations, actual DeepSeek V3 model implementation
|
||
- **Timeline**: Foundation is **compiling**, model implementation is the next major milestone
|
||
|
||
## Comparison to Other Projects
|
||
|
||
| Project | Language | Status | Focus |
|
||
|---------|----------|--------|-------|
|
||
| **This** | Zig | Foundation + API | Web-first inference |
|
||
| llama.cpp | C++ | Production | CLI/library |
|
||
| Candle | Rust | Production | ML framework |
|
||
| ZML | Zig | Research | Low-level ML ops |
|
||
|
||
**Unique advantages**: Built-in web server, Zig's zero-cost abstractions, single binary deployment.
|
||
|
||
---
|
||
|
||
**⚡ Built with Zig for blazing fast LLM inference!**
|
||
|
||
## Performance Notes
|
||
|
||
**Current Status**: ✅ **BLAS integration working** - Apple Accelerate backend now functional in draft implementation.
|
||
|
||
**Performance Results** (Apple M1 MacBook Pro under heavy load):
|
||
- **Matrix 256×256**: 0.0ms/iter, **937 GFLOPS**
|
||
- **Matrix 512×512**: 0.2ms/iter, **1084 GFLOPS**
|
||
- **Matrix 1024×1024**: 2.1ms/iter, **1164 GFLOPS** (peak performance)
|
||
- **Matrix 2048×2048**: 20.9ms/iter, **823 GFLOPS**
|
||
|
||
**Performance Achievement**: From **6418ms naive** → **2.1ms BLAS** = ~**3000x speedup** on matrix operations.
|
||
|
||
**System Status**:
|
||
- ✅ **BLAS Backend**: Apple Accelerate integration working
|
||
- ✅ **Peak Performance**: **1164 GFLOPS measured** (44% of theoretical maximum)
|
||
- ✅ **Memory Bandwidth**: 20.9 GB/s copying, well-optimized operations
|
||
- ✅ **Hardware Detection**: M-series Apple Silicon detection functional
|
||
|
||
**Next Steps**: Focus on transformer architecture, attention mechanisms, and model-specific optimizations for the draft DeepSeek V3 implementation. |