DeepSeek-V3/experimental/SETUP.md
Triex 31ef81000f feat: Migrate experimental implementation to modern Zig, achieve clean compilation (private repo dump -> /experimental)
- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns
- Fix mutability, unused variables, and API compatibility issues
- Validate SIMD tensor operations and backend architecture
- Foundation now compiles cleanly and produces working binary
2025-06-06 15:31:21 +10:00

7.8 KiB

DeepZig V3 Implementation - Setup Guide

This guide will help you set up the development environment and understand the project structure.

Prerequisites

1. Install Zig 0.15.0-dev

Download the latest development build from ziglang.org/download:

# macOS (using Homebrew)
brew install zig --HEAD

# Linux (manual installation)
wget https://ziglang.org/builds/zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
tar -xf zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
export PATH=$PATH:/path/to/zig

# Verify installation
zig version
# Should show: 0.15.0-dev.xxx

2. Platform-Specific Setup

macOS (for Metal backend)

# Install Xcode Command Line Tools
xcode-select --install

# Verify Metal support
system_profiler SPDisplaysDataType | grep Metal

Linux (for CUDA backend, optional)

# Install CUDA Toolkit (optional)
# Follow: https://developer.nvidia.com/cuda-downloads

# For Ubuntu/Debian:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

Project Overview

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Web Layer     │    │   Core Engine    │    │   Backends      │
│                 │    │                  │    │                 │
│ ├─ HTTP API     │◄──►│ ├─ Transformer   │◄──►│ ├─ CPU (SIMD)   │
│ ├─ WebSocket    │    │ ├─ Attention     │    │ ├─ Metal (macOS)│
│ ├─ Rate Limit   │    │ ├─ MoE Routing   │    │ ├─ CUDA (Linux) │
│ └─ Auth         │    │ └─ Tokenizer     │    │ └─ WebGPU       │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Key Components

Core Module (src/core/)

  • Tensor Operations: SIMD-optimized tensor math with AVX2/NEON support
  • Model Architecture: DeepSeek V3 implementation with MLA and MoE
  • Memory Management: Arena allocators and memory pools
  • Backend Abstraction: Unified interface for CPU/GPU computation

Web Layer (src/web/)

  • HTTP Server: Built on std.http.Server (Zig 0.15.0 compatible)
  • OpenAI API: Compatible /v1/chat/completions endpoint
  • Middleware: CORS, authentication, rate limiting
  • WebSocket: Streaming inference support (planned)

Backends (src/backends/)

  • CPU: Multi-threaded with SIMD optimizations
  • Metal: Apple Silicon GPU acceleration (macOS)
  • CUDA: NVIDIA GPU support with Tensor Cores (Linux/Windows)

Development Workflow

1. Initial Setup

# Clone the repository
cd experimental/

# Build the project
zig build

# Run tests to verify setup
zig build test

# Run benchmarks
zig build bench

2. Development Commands

# Format code
zig fmt src/

# Run tests with watch mode (in development)
zig build test

# Build optimized release
zig build -Doptimize=ReleaseFast

# Cross-compile for different targets
zig build -Dtarget=aarch64-macos    # Apple Silicon
zig build -Dtarget=x86_64-linux     # Linux x64
zig build -Dtarget=wasm32-freestanding # WebAssembly

3. Running the Server

# Default configuration (CPU backend, port 8080)
zig build run

# Custom configuration
zig build run -- --port 3000 --backend metal

# With model path (when implemented)
zig build run -- --model ./models/deepseek-v3.bin --backend cuda

4. Testing the API

# Health check
curl http://localhost:8080/health

# Model information
curl http://localhost:8080/v1/models

# Chat completion (placeholder response)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Implementation Status

Ready for Development

  • Build system and project structure
  • Core tensor operations with SIMD
  • HTTP server with basic routing
  • OpenAI API compatibility layer
  • Memory management utilities
  • Benchmark framework

🚧 Needs Implementation

  • DeepSeek V3 Model: Transformer architecture
  • Attention Mechanism: Multi-Head Latent Attention (MLA)
  • MoE Implementation: Expert routing and selection
  • Tokenizer: BPE tokenization (currently placeholder)
  • Model Loading: Weight file parsing and loading
  • GPU Backends: Metal and CUDA kernel implementations

📋 Future Enhancements

  • Model quantization (INT8, FP16)
  • Flash Attention optimization
  • WebSocket streaming
  • Distributed inference
  • Model sharding

Code Style and Conventions

Zig Best Practices

  • Use snake_case for functions and variables
  • Use PascalCase for types and structs
  • Prefer explicit error handling with ! and catch
  • Use arena allocators for request-scoped memory
  • Leverage comptime for zero-cost abstractions

Error Handling

// Preferred: explicit error handling
const result = someFunction() catch |err| switch (err) {
    error.OutOfMemory => return err,
    error.InvalidInput => {
        std.log.err("Invalid input provided");
        return err;
    },
    else => unreachable,
};

// Use defer for cleanup
var tensor = try Tensor.init(allocator, shape, .f32);
defer tensor.deinit();

Memory Management

// Use arena allocators for request scope
var arena = std.heap.ArenaAllocator.init(allocator);
defer arena.deinit();
const request_allocator = arena.allocator();

// Use memory pools for tensors
var tensor_pool = TensorPool.init(allocator);
defer tensor_pool.deinit();

Performance Considerations

SIMD Optimization

  • Use @Vector types for SIMD operations
  • Align data to cache line boundaries (64 bytes)
  • Prefer blocked algorithms for better cache locality

Backend Selection

  • CPU: Best for smaller models, development
  • Metal: Optimal for Apple Silicon (M1/M2/M3)
  • CUDA: Best for NVIDIA GPUs with Tensor Cores

Memory Layout

  • Use structure-of-arrays (SoA) for better vectorization
  • Minimize memory allocations in hot paths
  • Leverage unified memory on Apple Silicon

Debugging and Profiling

Debug Build

# Build with debug symbols
zig build -Doptimize=Debug

# Run with verbose logging
RUST_LOG=debug zig build run

Performance Profiling

# Run benchmarks
zig build bench

# Profile with system tools
# macOS: Instruments.app
# Linux: perf, valgrind
# Windows: Visual Studio Diagnostics

Next Steps

  1. Choose an area to implement:

    • Core ML components (transformer, attention, MoE)
    • Backend optimizations (Metal shaders, CUDA kernels)
    • Web features (streaming, authentication)
  2. Read the code:

    • Start with src/core/root.zig for module structure
    • Check src/main.zig for the server entry point
    • Look at bench/main.zig for performance testing
  3. Run and experiment:

    • Build and run the server
    • Try the API endpoints
    • Run benchmarks to understand performance
    • Read the TODOs in the code for implementation ideas
  4. Contribute:

    • Pick a TODO item
    • Implement and test
    • Submit improvements

Resources


Ready to build the future of high-performance LLM inference! 🚀