mirror of https://github.com/deepseek-ai/DeepSeek-V3.git synced 2025-06-19 16:13:48 -04:00

Triex 31ef81000f feat: Migrate experimental implementation to modern Zig, achieve clean compilation (private repo dump -> /experimental)

- Port HTTP server, and appropriate points across core etc from old API to Zig `0.15.0-dev` patterns
- Fix mutability, unused variables, and API compatibility issues
- Validate SIMD tensor operations and backend architecture
- Foundation now compiles cleanly and produces working binary

2025-06-06 15:31:21 +10:00

7.8 KiB

Raw Blame History

DeepZig V3 Implementation - Setup Guide

This guide will help you set up the development environment and understand the project structure.

Prerequisites

1. Install Zig 0.15.0-dev

Download the latest development build from ziglang.org/download:

# macOS (using Homebrew)
brew install zig --HEAD

# Linux (manual installation)
wget https://ziglang.org/builds/zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
tar -xf zig-linux-x86_64-0.15.0-dev.xxx.tar.xz
export PATH=$PATH:/path/to/zig

# Verify installation
zig version
# Should show: 0.15.0-dev.xxx

2. Platform-Specific Setup

macOS (for Metal backend)

# Install Xcode Command Line Tools
xcode-select --install

# Verify Metal support
system_profiler SPDisplaysDataType | grep Metal

Linux (for CUDA backend, optional)

# Install CUDA Toolkit (optional)
# Follow: https://developer.nvidia.com/cuda-downloads

# For Ubuntu/Debian:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

Project Overview

Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Web Layer     │    │   Core Engine    │    │   Backends      │
│                 │    │                  │    │                 │
│ ├─ HTTP API     │◄──►│ ├─ Transformer   │◄──►│ ├─ CPU (SIMD)   │
│ ├─ WebSocket    │    │ ├─ Attention     │    │ ├─ Metal (macOS)│
│ ├─ Rate Limit   │    │ ├─ MoE Routing   │    │ ├─ CUDA (Linux) │
│ └─ Auth         │    │ └─ Tokenizer     │    │ └─ WebGPU       │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Key Components

Core Module (`src/core/`)

Tensor Operations: SIMD-optimized tensor math with AVX2/NEON support
Model Architecture: DeepSeek V3 implementation with MLA and MoE
Memory Management: Arena allocators and memory pools
Backend Abstraction: Unified interface for CPU/GPU computation

Web Layer (`src/web/`)

HTTP Server: Built on std.http.Server (Zig 0.15.0 compatible)
OpenAI API: Compatible /v1/chat/completions endpoint
Middleware: CORS, authentication, rate limiting
WebSocket: Streaming inference support (planned)

Backends (`src/backends/`)

CPU: Multi-threaded with SIMD optimizations
Metal: Apple Silicon GPU acceleration (macOS)
CUDA: NVIDIA GPU support with Tensor Cores (Linux/Windows)

Development Workflow

1. Initial Setup

# Clone the repository
cd experimental/

# Build the project
zig build

# Run tests to verify setup
zig build test

# Run benchmarks
zig build bench

2. Development Commands

# Format code
zig fmt src/

# Run tests with watch mode (in development)
zig build test

# Build optimized release
zig build -Doptimize=ReleaseFast

# Cross-compile for different targets
zig build -Dtarget=aarch64-macos    # Apple Silicon
zig build -Dtarget=x86_64-linux     # Linux x64
zig build -Dtarget=wasm32-freestanding # WebAssembly

3. Running the Server

# Default configuration (CPU backend, port 8080)
zig build run

# Custom configuration
zig build run -- --port 3000 --backend metal

# With model path (when implemented)
zig build run -- --model ./models/deepseek-v3.bin --backend cuda

4. Testing the API

# Health check
curl http://localhost:8080/health

# Model information
curl http://localhost:8080/v1/models

# Chat completion (placeholder response)
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

Implementation Status

✅ Ready for Development

Build system and project structure
Core tensor operations with SIMD
HTTP server with basic routing
OpenAI API compatibility layer
Memory management utilities
Benchmark framework

🚧 Needs Implementation

DeepSeek V3 Model: Transformer architecture
Attention Mechanism: Multi-Head Latent Attention (MLA)
MoE Implementation: Expert routing and selection
Tokenizer: BPE tokenization (currently placeholder)
Model Loading: Weight file parsing and loading
GPU Backends: Metal and CUDA kernel implementations

📋 Future Enhancements

Model quantization (INT8, FP16)
Flash Attention optimization
WebSocket streaming
Distributed inference
Model sharding

Code Style and Conventions

Zig Best Practices

Use snake_case for functions and variables
Use PascalCase for types and structs
Prefer explicit error handling with ! and catch
Use arena allocators for request-scoped memory
Leverage comptime for zero-cost abstractions

Error Handling

// Preferred: explicit error handling
const result = someFunction() catch |err| switch (err) {
    error.OutOfMemory => return err,
    error.InvalidInput => {
        std.log.err("Invalid input provided");
        return err;
    },
    else => unreachable,
};

// Use defer for cleanup
var tensor = try Tensor.init(allocator, shape, .f32);
defer tensor.deinit();

Memory Management

// Use arena allocators for request scope
var arena = std.heap.ArenaAllocator.init(allocator);
defer arena.deinit();
const request_allocator = arena.allocator();

// Use memory pools for tensors
var tensor_pool = TensorPool.init(allocator);
defer tensor_pool.deinit();

Performance Considerations

SIMD Optimization

Use @Vector types for SIMD operations
Align data to cache line boundaries (64 bytes)
Prefer blocked algorithms for better cache locality

Backend Selection

CPU: Best for smaller models, development
Metal: Optimal for Apple Silicon (M1/M2/M3)
CUDA: Best for NVIDIA GPUs with Tensor Cores

Memory Layout

Use structure-of-arrays (SoA) for better vectorization
Minimize memory allocations in hot paths
Leverage unified memory on Apple Silicon

Debugging and Profiling

Debug Build

# Build with debug symbols
zig build -Doptimize=Debug

# Run with verbose logging
RUST_LOG=debug zig build run

Performance Profiling

# Run benchmarks
zig build bench

# Profile with system tools
# macOS: Instruments.app
# Linux: perf, valgrind
# Windows: Visual Studio Diagnostics

Next Steps

Choose an area to implement:
- Core ML components (transformer, attention, MoE)
- Backend optimizations (Metal shaders, CUDA kernels)
- Web features (streaming, authentication)
Read the code:
- Start with src/core/root.zig for module structure
- Check src/main.zig for the server entry point
- Look at bench/main.zig for performance testing
Run and experiment:
- Build and run the server
- Try the API endpoints
- Run benchmarks to understand performance
- Read the TODOs in the code for implementation ideas
Contribute:
- Pick a TODO item
- Implement and test
- Submit improvements

Resources

Ready to build the future of high-performance LLM inference! 🚀

7.8 KiB Raw Blame History