mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-06-20 00:23:50 -04:00
- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns - Fix mutability, unused variables, and API compatibility issues - Validate SIMD tensor operations and backend architecture - Foundation now compiles cleanly and produces working binary
286 lines
9.0 KiB
Markdown
286 lines
9.0 KiB
Markdown
# DeepZig V3 Implementation 🚀
|
|
|
|
A high-performance implementation of DeepSeek V3 in [Zig](https://ziglang.org/) for blazingly fast inference.
|
|
|
|
> **⚠️ Status: Experimental Foundation**
|
|
>
|
|
> This project provides a **base foundation** for DeepSeek V3 in Zig with:
|
|
> - ✅ **Working HTTP server** with OpenAI-compatible API
|
|
> - ✅ **SIMD-optimized tensor operations** (AVX2, NEON)
|
|
> - ✅ **Cross-platform build system** (Zig 0.15.0-dev)
|
|
> - ✅ **Memory management** and backend architecture
|
|
>
|
|
> **Not yet implemented**: Full DeepSeek V3 model architecture, attention mechanisms, MoE routing.
|
|
> See [Development Status](#development-status) for details.
|
|
|
|
## Overview
|
|
|
|
This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:
|
|
|
|
- **Zero-cost abstractions** with compile-time optimization
|
|
- **Direct hardware access** for SIMD and platform-specific optimizations
|
|
- **Manual memory management** without garbage collection pauses
|
|
- **Single binary deployment** with no runtime dependencies
|
|
- **Cross-platform compilation** for multiple architectures
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
experimental/
|
|
├── build.zig # Build system configuration
|
|
├── build.zig.zon # Package dependencies
|
|
├── src/
|
|
│ ├── main.zig # HTTP server entry point
|
|
│ ├── core/ # Core ML components
|
|
│ │ ├── root.zig # Module exports
|
|
│ │ ├── tensor.zig # SIMD-optimized tensors
|
|
│ │ ├── model.zig # DeepSeek V3 model
|
|
│ │ ├── attention.zig # MLA attention mechanism
|
|
│ │ ├── moe.zig # Mixture of Experts
|
|
│ │ ├── tokenizer.zig # Text tokenization
|
|
│ │ ├── backend.zig # Backend abstraction
|
|
│ │ ├── memory.zig # Memory management
|
|
│ │ └── math/ # Math utilities
|
|
│ │ ├── root.zig # Math module exports
|
|
│ │ ├── simd.zig # SIMD operations
|
|
│ │ ├── activation.zig # Activation functions
|
|
│ │ └── rms_norm.zig # RMS normalization
|
|
│ ├── web/ # HTTP API layer
|
|
│ │ ├── root.zig # Web module exports
|
|
│ │ ├── server.zig # HTTP server (std.http)
|
|
│ │ ├── handlers.zig # Request handlers
|
|
│ │ ├── middleware.zig # CORS, auth, rate limiting
|
|
│ │ ├── websocket.zig # WebSocket support
|
|
│ │ ├── openai.zig # OpenAI API compatibility
|
|
│ │ ├── request.zig # Request wrapper
|
|
│ │ └── response.zig # Response wrapper
|
|
│ ├── backends/ # Compute backends
|
|
│ │ ├── cpu/ # CPU with SIMD
|
|
│ │ ├── metal/ # Apple Silicon
|
|
│ │ └── cuda/ # NVIDIA GPUs
|
|
│ └── wasm/
|
|
│ └── main.zig # WebAssembly entry point
|
|
├── bench/
|
|
│ └── main.zig # Performance benchmarks
|
|
└── README.md # This file
|
|
```
|
|
|
|
## Requirements
|
|
|
|
- **Zig 0.15.0-dev** or later
|
|
- Platform-specific requirements:
|
|
- **macOS**: Xcode Command Line Tools (for Metal backend)
|
|
- **Linux**: CUDA Toolkit (for CUDA backend, optional)
|
|
- **Windows**: CUDA Toolkit (for CUDA backend, optional)
|
|
|
|
## Quick Start
|
|
|
|
### Building
|
|
|
|
```bash
|
|
# Clone and navigate to experimental directory
|
|
cd experimental/
|
|
|
|
# Build the project
|
|
zig build
|
|
|
|
# Run the server
|
|
zig build run
|
|
|
|
# Run tests
|
|
zig build test
|
|
|
|
# Run benchmarks
|
|
zig build bench
|
|
|
|
# Build WebAssembly
|
|
zig build wasm
|
|
```
|
|
|
|
### Running the Server
|
|
|
|
```bash
|
|
# Start server on default port (8080)
|
|
./zig-out/bin/deepseek-v3-zig
|
|
|
|
# Custom configuration
|
|
./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model
|
|
```
|
|
|
|
### API Usage
|
|
|
|
The server exposes OpenAI-compatible endpoints:
|
|
|
|
```bash
|
|
# Chat completion
|
|
curl -X POST http://localhost:8080/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "deepseek-v3",
|
|
"messages": [{"role": "user", "content": "Hello!"}],
|
|
"max_tokens": 100
|
|
}'
|
|
|
|
# Health check
|
|
curl http://localhost:8080/health
|
|
|
|
# Model info
|
|
curl http://localhost:8080/v1/models
|
|
```
|
|
|
|
## Performance Features
|
|
|
|
### SIMD Optimizations
|
|
|
|
- **x86_64**: AVX2/AVX-512 vectorization for matrix operations
|
|
- **ARM64**: NEON SIMD for Apple Silicon optimization
|
|
- **Auto-vectorization**: Compiler-optimized loops with `@Vector` types
|
|
|
|
### Backend Support
|
|
|
|
| Backend | Status | Features |
|
|
|---------|--------|----------|
|
|
| **CPU** | ✅ Implemented | Multi-threaded, SIMD, cache-optimized |
|
|
| **Metal** | 🚧 In Progress | Apple Silicon GPU, unified memory |
|
|
| **CUDA** | 🚧 Planned | NVIDIA GPU, Tensor Cores |
|
|
| **WebGPU** | 📋 Future | Browser GPU acceleration |
|
|
|
|
### Memory Management
|
|
|
|
- **Arena allocators** for request-scoped memory
|
|
- **Memory pools** for tensor allocations
|
|
- **Zero-copy operations** where possible
|
|
- **Cache-friendly** data layouts
|
|
|
|
## Development Status
|
|
|
|
### ✅ Drafted
|
|
- [x] Project structure and build system
|
|
- [x] Core tensor operations with SIMD
|
|
- [x] HTTP server with OpenAI API compatibility
|
|
- [x] CPU backend with optimizations
|
|
- [x] Memory management utilities
|
|
- [x] Benchmark suite
|
|
|
|
### 🚧 In Progress
|
|
- [ ] DeepSeek V3 model architecture
|
|
- [ ] Multi-Head Latent Attention (MLA)
|
|
- [ ] Mixture of Experts (MoE) implementation
|
|
- [ ] Metal backend for Apple Silicon
|
|
- [ ] Model loading and weight management
|
|
|
|
### 📋 Planned
|
|
- [ ] CUDA backend for NVIDIA GPUs
|
|
- [ ] WebSocket streaming
|
|
- [ ] Model quantization (INT8, FP16)
|
|
- [ ] Flash Attention optimization
|
|
- [ ] Distributed inference
|
|
- [ ] Advanced sampling strategies
|
|
|
|
## Architecture Decisions
|
|
|
|
### Why Zig?
|
|
|
|
1. **Performance**: Zero-cost abstractions without runtime overhead
|
|
2. **Memory Safety**: Compile-time memory management without GC
|
|
3. **Simplicity**: Single binary deployment, cross-compilation
|
|
4. **Control**: Direct hardware access for optimization
|
|
|
|
### Design Principles
|
|
|
|
- **Modularity**: Clean separation between core, web, and backend layers
|
|
- **Performance**: SIMD-first design with cache-friendly algorithms
|
|
- **Compatibility**: OpenAI API compatibility for easy adoption
|
|
- **Extensibility**: Plugin architecture for new backends
|
|
|
|
## Contributing
|
|
|
|
This is an experimental project! Contributions are welcome:
|
|
|
|
1. **Core ML**: Implement transformer layers, attention mechanisms
|
|
2. **Backends**: Optimize CUDA/Metal compute kernels
|
|
3. **Performance**: Profile and optimize bottlenecks
|
|
4. **Testing**: Add comprehensive test coverage
|
|
5. **Documentation**: Improve setup and usage guides
|
|
|
|
### Development Setup
|
|
|
|
```bash
|
|
# Install Zig 0.15.0-dev
|
|
# https://ziglang.org/download/
|
|
|
|
# Clone repository
|
|
git clone [repository-url]
|
|
cd experimental/
|
|
|
|
# Run tests during development
|
|
zig build test --watch
|
|
|
|
# Format code
|
|
zig fmt src/
|
|
```
|
|
|
|
## Benchmarks
|
|
|
|
Run benchmarks to measure performance:
|
|
|
|
```bash
|
|
zig build bench
|
|
```
|
|
|
|
Example output:
|
|
```
|
|
🚀 DeepZig V3 Performance Benchmarks
|
|
==========================================
|
|
|
|
Backend: CPU (SIMD optimized)
|
|
Architecture: x86_64
|
|
Thread count: 16
|
|
|
|
Operation | Iterations | Avg Time | Operations/s | Memory
|
|
-------------------------------|------------|-----------|--------------|-------
|
|
Tensor Creation (1024x1024) | 1000 iter | 0.05 ms | 20000000 ops/s | 4.0 MB
|
|
Tensor Addition (SIMD) | 100 iter | 0.12 ms | 35000000000 ops/s | 48.0 MB
|
|
Matrix Multiplication | 10 iter | 125.30 ms | 17.2 GFLOPS | 12.0 MB
|
|
```
|
|
|
|
## Known Issues
|
|
|
|
- **Model Loading**: Currently creates dummy models - real weight loading not implemented
|
|
- **Tokenizer**: Placeholder implementation - needs proper BPE tokenizer
|
|
- **WebSocket**: Basic structure only - streaming not implemented
|
|
- **Metal/CUDA**: Backend stubs only - GPU kernels not implemented
|
|
|
|
## License
|
|
|
|
This experimental implementation follows the same license as the original DeepSeek V3 project.
|
|
|
|
## Resources
|
|
|
|
- [Original DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
|
|
- [Zig Language Documentation](https://ziglang.org/documentation/master/)
|
|
- [Zig Performance Guide](https://github.com/ziglang/zig/wiki/Performance)
|
|
- [SIMD in Zig](https://ziglang.org/documentation/master/#Vectors)
|
|
|
|
## Is This Ready for Production?
|
|
|
|
**No** - this is a research/development foundation. But it's **theoretical and compiles**:
|
|
|
|
- **What works now**: ✅ Compiles with Zig 0.15.0-dev, tensor math, SIMD operations, benchmarks, backend architecture
|
|
- **What's missing**: HTTP server API update, actual DeepSeek V3 model implementation
|
|
- **Timeline**: Foundation is **compiling**, model implementation is the next major milestone
|
|
|
|
## Comparison to Other Projects
|
|
|
|
| Project | Language | Status | Focus |
|
|
|---------|----------|--------|-------|
|
|
| **This** | Zig | Foundation + API | Web-first inference |
|
|
| llama.cpp | C++ | Production | CLI/library |
|
|
| Candle | Rust | Production | ML framework |
|
|
| ZML | Zig | Research | Low-level ML ops |
|
|
|
|
**Unique advantages**: Built-in web server, Zig's zero-cost abstractions, single binary deployment.
|
|
|
|
---
|
|
|
|
**⚡ Built with Zig for blazing fast LLM inference!** |