mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-06-20 00:23:50 -04:00
- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns - Fix mutability, unused variables, and API compatibility issues - Validate SIMD tensor operations and backend architecture - Foundation now compiles cleanly and produces working binary
285 lines
7.8 KiB
Markdown
285 lines
7.8 KiB
Markdown
# DeepZig V3 Implementation - Setup Guide
|
|
|
|
This guide will help you set up the development environment and understand the project structure.
|
|
|
|
## Prerequisites
|
|
|
|
### 1. Install Zig 0.15.0-dev
|
|
|
|
Download the latest development build from [ziglang.org/download](https://ziglang.org/download/):
|
|
|
|
```bash
|
|
# macOS (using Homebrew)
|
|
brew install zig --HEAD
|
|
|
|
# Linux (manual installation)
|
|
wget https://ziglang.org/builds/zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
|
|
tar -xf zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
|
|
export PATH=$PATH:/path/to/zig
|
|
|
|
# Verify installation
|
|
zig version
|
|
# Should show: 0.15.0-dev.xxx
|
|
```
|
|
|
|
### 2. Platform-Specific Setup
|
|
|
|
#### macOS (for Metal backend)
|
|
```bash
|
|
# Install Xcode Command Line Tools
|
|
xcode-select --install
|
|
|
|
# Verify Metal support
|
|
system_profiler SPDisplaysDataType | grep Metal
|
|
```
|
|
|
|
#### Linux (for CUDA backend, optional)
|
|
```bash
|
|
# Install CUDA Toolkit (optional)
|
|
# Follow: https://developer.nvidia.com/cuda-downloads
|
|
|
|
# For Ubuntu/Debian:
|
|
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
|
|
sudo dpkg -i cuda-keyring_1.0-1_all.deb
|
|
sudo apt-get update
|
|
sudo apt-get -y install cuda
|
|
```
|
|
|
|
## Project Overview
|
|
|
|
### Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
|
|
│ Web Layer │ │ Core Engine │ │ Backends │
|
|
│ │ │ │ │ │
|
|
│ ├─ HTTP API │◄──►│ ├─ Transformer │◄──►│ ├─ CPU (SIMD) │
|
|
│ ├─ WebSocket │ │ ├─ Attention │ │ ├─ Metal (macOS)│
|
|
│ ├─ Rate Limit │ │ ├─ MoE Routing │ │ ├─ CUDA (Linux) │
|
|
│ └─ Auth │ │ └─ Tokenizer │ │ └─ WebGPU │
|
|
└─────────────────┘ └──────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
### Key Components
|
|
|
|
#### Core Module (`src/core/`)
|
|
- **Tensor Operations**: SIMD-optimized tensor math with AVX2/NEON support
|
|
- **Model Architecture**: DeepSeek V3 implementation with MLA and MoE
|
|
- **Memory Management**: Arena allocators and memory pools
|
|
- **Backend Abstraction**: Unified interface for CPU/GPU computation
|
|
|
|
#### Web Layer (`src/web/`)
|
|
- **HTTP Server**: Built on `std.http.Server` (Zig 0.15.0 compatible)
|
|
- **OpenAI API**: Compatible `/v1/chat/completions` endpoint
|
|
- **Middleware**: CORS, authentication, rate limiting
|
|
- **WebSocket**: Streaming inference support (planned)
|
|
|
|
#### Backends (`src/backends/`)
|
|
- **CPU**: Multi-threaded with SIMD optimizations
|
|
- **Metal**: Apple Silicon GPU acceleration (macOS)
|
|
- **CUDA**: NVIDIA GPU support with Tensor Cores (Linux/Windows)
|
|
|
|
## Development Workflow
|
|
|
|
### 1. Initial Setup
|
|
|
|
```bash
|
|
# Clone the repository
|
|
cd experimental/
|
|
|
|
# Build the project
|
|
zig build
|
|
|
|
# Run tests to verify setup
|
|
zig build test
|
|
|
|
# Run benchmarks
|
|
zig build bench
|
|
```
|
|
|
|
### 2. Development Commands
|
|
|
|
```bash
|
|
# Format code
|
|
zig fmt src/
|
|
|
|
# Run tests with watch mode (in development)
|
|
zig build test
|
|
|
|
# Build optimized release
|
|
zig build -Doptimize=ReleaseFast
|
|
|
|
# Cross-compile for different targets
|
|
zig build -Dtarget=aarch64-macos # Apple Silicon
|
|
zig build -Dtarget=x86_64-linux # Linux x64
|
|
zig build -Dtarget=wasm32-freestanding # WebAssembly
|
|
```
|
|
|
|
### 3. Running the Server
|
|
|
|
```bash
|
|
# Default configuration (CPU backend, port 8080)
|
|
zig build run
|
|
|
|
# Custom configuration
|
|
zig build run -- --port 3000 --backend metal
|
|
|
|
# With model path (when implemented)
|
|
zig build run -- --model ./models/deepseek-v3.bin --backend cuda
|
|
```
|
|
|
|
### 4. Testing the API
|
|
|
|
```bash
|
|
# Health check
|
|
curl http://localhost:8080/health
|
|
|
|
# Model information
|
|
curl http://localhost:8080/v1/models
|
|
|
|
# Chat completion (placeholder response)
|
|
curl -X POST http://localhost:8080/v1/chat/completions \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "deepseek-v3",
|
|
"messages": [{"role": "user", "content": "Hello!"}],
|
|
"max_tokens": 100
|
|
}'
|
|
```
|
|
|
|
## Implementation Status
|
|
|
|
### ✅ Ready for Development
|
|
- [x] Build system and project structure
|
|
- [x] Core tensor operations with SIMD
|
|
- [x] HTTP server with basic routing
|
|
- [x] OpenAI API compatibility layer
|
|
- [x] Memory management utilities
|
|
- [x] Benchmark framework
|
|
|
|
### 🚧 Needs Implementation
|
|
- [ ] **DeepSeek V3 Model**: Transformer architecture
|
|
- [ ] **Attention Mechanism**: Multi-Head Latent Attention (MLA)
|
|
- [ ] **MoE Implementation**: Expert routing and selection
|
|
- [ ] **Tokenizer**: BPE tokenization (currently placeholder)
|
|
- [ ] **Model Loading**: Weight file parsing and loading
|
|
- [ ] **GPU Backends**: Metal and CUDA kernel implementations
|
|
|
|
### 📋 Future Enhancements
|
|
- [ ] Model quantization (INT8, FP16)
|
|
- [ ] Flash Attention optimization
|
|
- [ ] WebSocket streaming
|
|
- [ ] Distributed inference
|
|
- [ ] Model sharding
|
|
|
|
## Code Style and Conventions
|
|
|
|
### Zig Best Practices
|
|
- Use `snake_case` for functions and variables
|
|
- Use `PascalCase` for types and structs
|
|
- Prefer explicit error handling with `!` and `catch`
|
|
- Use arena allocators for request-scoped memory
|
|
- Leverage comptime for zero-cost abstractions
|
|
|
|
### Error Handling
|
|
```zig
|
|
// Preferred: explicit error handling
|
|
const result = someFunction() catch |err| switch (err) {
|
|
error.OutOfMemory => return err,
|
|
error.InvalidInput => {
|
|
std.log.err("Invalid input provided");
|
|
return err;
|
|
},
|
|
else => unreachable,
|
|
};
|
|
|
|
// Use defer for cleanup
|
|
var tensor = try Tensor.init(allocator, shape, .f32);
|
|
defer tensor.deinit();
|
|
```
|
|
|
|
### Memory Management
|
|
```zig
|
|
// Use arena allocators for request scope
|
|
var arena = std.heap.ArenaAllocator.init(allocator);
|
|
defer arena.deinit();
|
|
const request_allocator = arena.allocator();
|
|
|
|
// Use memory pools for tensors
|
|
var tensor_pool = TensorPool.init(allocator);
|
|
defer tensor_pool.deinit();
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
### SIMD Optimization
|
|
- Use `@Vector` types for SIMD operations
|
|
- Align data to cache line boundaries (64 bytes)
|
|
- Prefer blocked algorithms for better cache locality
|
|
|
|
### Backend Selection
|
|
- **CPU**: Best for smaller models, development
|
|
- **Metal**: Optimal for Apple Silicon (M1/M2/M3)
|
|
- **CUDA**: Best for NVIDIA GPUs with Tensor Cores
|
|
|
|
### Memory Layout
|
|
- Use structure-of-arrays (SoA) for better vectorization
|
|
- Minimize memory allocations in hot paths
|
|
- Leverage unified memory on Apple Silicon
|
|
|
|
## Debugging and Profiling
|
|
|
|
### Debug Build
|
|
```bash
|
|
# Build with debug symbols
|
|
zig build -Doptimize=Debug
|
|
|
|
# Run with verbose logging
|
|
RUST_LOG=debug zig build run
|
|
```
|
|
|
|
### Performance Profiling
|
|
```bash
|
|
# Run benchmarks
|
|
zig build bench
|
|
|
|
# Profile with system tools
|
|
# macOS: Instruments.app
|
|
# Linux: perf, valgrind
|
|
# Windows: Visual Studio Diagnostics
|
|
```
|
|
|
|
## Next Steps
|
|
|
|
1. **Choose an area to implement**:
|
|
- Core ML components (transformer, attention, MoE)
|
|
- Backend optimizations (Metal shaders, CUDA kernels)
|
|
- Web features (streaming, authentication)
|
|
|
|
2. **Read the code**:
|
|
- Start with `src/core/root.zig` for module structure
|
|
- Check `src/main.zig` for the server entry point
|
|
- Look at `bench/main.zig` for performance testing
|
|
|
|
3. **Run and experiment**:
|
|
- Build and run the server
|
|
- Try the API endpoints
|
|
- Run benchmarks to understand performance
|
|
- Read the TODOs in the code for implementation ideas
|
|
|
|
4. **Contribute**:
|
|
- Pick a TODO item
|
|
- Implement and test
|
|
- Submit improvements
|
|
|
|
## Resources
|
|
|
|
- [Zig Language Reference](https://ziglang.org/documentation/master/)
|
|
- [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
|
|
- [Zig SIMD Guide](https://ziglang.org/documentation/master/#Vectors)
|
|
- [Metal Programming Guide](https://developer.apple.com/metal/)
|
|
- [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)
|
|
|
|
---
|
|
|
|
Ready to build the future of high-performance LLM inference! 🚀 |