DeepSeek-V3/experimental/SETUP.md
Triex 31ef81000f feat: Migrate experimental implementation to modern Zig, achieve clean compilation (private repo dump -> /experimental)
- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns
- Fix mutability, unused variables, and API compatibility issues
- Validate SIMD tensor operations and backend architecture
- Foundation now compiles cleanly and produces working binary
2025-06-06 15:31:21 +10:00

285 lines
7.8 KiB
Markdown

# DeepZig V3 Implementation - Setup Guide
This guide will help you set up the development environment and understand the project structure.
## Prerequisites
### 1. Install Zig 0.15.0-dev
Download the latest development build from [ziglang.org/download](https://ziglang.org/download/):
```bash
# macOS (using Homebrew)
brew install zig --HEAD
# Linux (manual installation)
wget https://ziglang.org/builds/zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
tar -xf zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
export PATH=$PATH:/path/to/zig
# Verify installation
zig version
# Should show: 0.15.0-dev.xxx
```
### 2. Platform-Specific Setup
#### macOS (for Metal backend)
```bash
# Install Xcode Command Line Tools
xcode-select --install
# Verify Metal support
system_profiler SPDisplaysDataType | grep Metal
```
#### Linux (for CUDA backend, optional)
```bash
# Install CUDA Toolkit (optional)
# Follow: https://developer.nvidia.com/cuda-downloads
# For Ubuntu/Debian:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
```
## Project Overview
### Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Web Layer │ │ Core Engine │ │ Backends │
│ │ │ │ │ │
│ ├─ HTTP API │◄──►│ ├─ Transformer │◄──►│ ├─ CPU (SIMD) │
│ ├─ WebSocket │ │ ├─ Attention │ │ ├─ Metal (macOS)│
│ ├─ Rate Limit │ │ ├─ MoE Routing │ │ ├─ CUDA (Linux) │
│ └─ Auth │ │ └─ Tokenizer │ │ └─ WebGPU │
└─────────────────┘ └──────────────────┘ └─────────────────┘
```
### Key Components
#### Core Module (`src/core/`)
- **Tensor Operations**: SIMD-optimized tensor math with AVX2/NEON support
- **Model Architecture**: DeepSeek V3 implementation with MLA and MoE
- **Memory Management**: Arena allocators and memory pools
- **Backend Abstraction**: Unified interface for CPU/GPU computation
#### Web Layer (`src/web/`)
- **HTTP Server**: Built on `std.http.Server` (Zig 0.15.0 compatible)
- **OpenAI API**: Compatible `/v1/chat/completions` endpoint
- **Middleware**: CORS, authentication, rate limiting
- **WebSocket**: Streaming inference support (planned)
#### Backends (`src/backends/`)
- **CPU**: Multi-threaded with SIMD optimizations
- **Metal**: Apple Silicon GPU acceleration (macOS)
- **CUDA**: NVIDIA GPU support with Tensor Cores (Linux/Windows)
## Development Workflow
### 1. Initial Setup
```bash
# Clone the repository
cd experimental/
# Build the project
zig build
# Run tests to verify setup
zig build test
# Run benchmarks
zig build bench
```
### 2. Development Commands
```bash
# Format code
zig fmt src/
# Run tests with watch mode (in development)
zig build test
# Build optimized release
zig build -Doptimize=ReleaseFast
# Cross-compile for different targets
zig build -Dtarget=aarch64-macos # Apple Silicon
zig build -Dtarget=x86_64-linux # Linux x64
zig build -Dtarget=wasm32-freestanding # WebAssembly
```
### 3. Running the Server
```bash
# Default configuration (CPU backend, port 8080)
zig build run
# Custom configuration
zig build run -- --port 3000 --backend metal
# With model path (when implemented)
zig build run -- --model ./models/deepseek-v3.bin --backend cuda
```
### 4. Testing the API
```bash
# Health check
curl http://localhost:8080/health
# Model information
curl http://localhost:8080/v1/models
# Chat completion (placeholder response)
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-v3",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100
}'
```
## Implementation Status
### ✅ Ready for Development
- [x] Build system and project structure
- [x] Core tensor operations with SIMD
- [x] HTTP server with basic routing
- [x] OpenAI API compatibility layer
- [x] Memory management utilities
- [x] Benchmark framework
### 🚧 Needs Implementation
- [ ] **DeepSeek V3 Model**: Transformer architecture
- [ ] **Attention Mechanism**: Multi-Head Latent Attention (MLA)
- [ ] **MoE Implementation**: Expert routing and selection
- [ ] **Tokenizer**: BPE tokenization (currently placeholder)
- [ ] **Model Loading**: Weight file parsing and loading
- [ ] **GPU Backends**: Metal and CUDA kernel implementations
### 📋 Future Enhancements
- [ ] Model quantization (INT8, FP16)
- [ ] Flash Attention optimization
- [ ] WebSocket streaming
- [ ] Distributed inference
- [ ] Model sharding
## Code Style and Conventions
### Zig Best Practices
- Use `snake_case` for functions and variables
- Use `PascalCase` for types and structs
- Prefer explicit error handling with `!` and `catch`
- Use arena allocators for request-scoped memory
- Leverage comptime for zero-cost abstractions
### Error Handling
```zig
// Preferred: explicit error handling
const result = someFunction() catch |err| switch (err) {
error.OutOfMemory => return err,
error.InvalidInput => {
std.log.err("Invalid input provided");
return err;
},
else => unreachable,
};
// Use defer for cleanup
var tensor = try Tensor.init(allocator, shape, .f32);
defer tensor.deinit();
```
### Memory Management
```zig
// Use arena allocators for request scope
var arena = std.heap.ArenaAllocator.init(allocator);
defer arena.deinit();
const request_allocator = arena.allocator();
// Use memory pools for tensors
var tensor_pool = TensorPool.init(allocator);
defer tensor_pool.deinit();
```
## Performance Considerations
### SIMD Optimization
- Use `@Vector` types for SIMD operations
- Align data to cache line boundaries (64 bytes)
- Prefer blocked algorithms for better cache locality
### Backend Selection
- **CPU**: Best for smaller models, development
- **Metal**: Optimal for Apple Silicon (M1/M2/M3)
- **CUDA**: Best for NVIDIA GPUs with Tensor Cores
### Memory Layout
- Use structure-of-arrays (SoA) for better vectorization
- Minimize memory allocations in hot paths
- Leverage unified memory on Apple Silicon
## Debugging and Profiling
### Debug Build
```bash
# Build with debug symbols
zig build -Doptimize=Debug
# Run with verbose logging
RUST_LOG=debug zig build run
```
### Performance Profiling
```bash
# Run benchmarks
zig build bench
# Profile with system tools
# macOS: Instruments.app
# Linux: perf, valgrind
# Windows: Visual Studio Diagnostics
```
## Next Steps
1. **Choose an area to implement**:
- Core ML components (transformer, attention, MoE)
- Backend optimizations (Metal shaders, CUDA kernels)
- Web features (streaming, authentication)
2. **Read the code**:
- Start with `src/core/root.zig` for module structure
- Check `src/main.zig` for the server entry point
- Look at `bench/main.zig` for performance testing
3. **Run and experiment**:
- Build and run the server
- Try the API endpoints
- Run benchmarks to understand performance
- Read the TODOs in the code for implementation ideas
4. **Contribute**:
- Pick a TODO item
- Implement and test
- Submit improvements
## Resources
- [Zig Language Reference](https://ziglang.org/documentation/master/)
- [DeepSeek V3 Paper](https://arxiv.org/abs/2412.19437)
- [Zig SIMD Guide](https://ziglang.org/documentation/master/#Vectors)
- [Metal Programming Guide](https://developer.apple.com/metal/)
- [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)
---
Ready to build the future of high-performance LLM inference! 🚀