mirror of https://github.com/deepseek-ai/DeepSeek-V3.git synced 2025-07-05 07:51:38 -04:00

History

Triex c24c4dc1eb docs: Update benchmarks		2025-06-11 21:24:34 +10:00
..
bench	feat: implement dynamic benchmark summary with real performance metrics	2025-06-11 19:41:51 +10:00
src	feat: BLAS integration working - significant matrix operation improvements	2025-06-11 19:30:33 +10:00
build.zig	feat: BLAS integration working - significant matrix operation improvements	2025-06-11 19:30:33 +10:00
build.zig.zon	feat: Migrate experimental implementation to modern Zig, achieve clean compilation (private repo dump -> `/experimental`)	2025-06-06 15:31:21 +10:00
README.md	docs: Update benchmarks	2025-06-11 21:24:34 +10:00
SETUP.md	feat: Migrate experimental implementation to modern Zig, achieve clean compilation (private repo dump -> `/experimental`)	2025-06-06 15:31:21 +10:00

README.md

DeepZig V3 Implementation 🚀

A high-performance implementation of DeepSeek V3 in Zig for blazingly fast inference.

⚠️ Status: Experimental Foundation

This project provides an experimental foundation for DeepZig V3 with working draft implementation:

✅ HTTP server with OpenAI-compatible API

✅ BLAS-accelerated tensor operations (Apple Accelerate working)

✅ Cross-platform build system (Zig 0.15.0-dev)

✅ Memory management and backend architecture

✅ Apple Silicon detection and optimization

✅ Functional matrix operations (significant performance improvement)

Recent Progress: Matrix operations now use BLAS acceleration
Performance Status: 1160+ GFLOPS with Apple Accelerate backend working (measured on Apple M1 Macbook)

See Performance Results for detailed benchmarks.

Overview

This experimental implementation aims to leverage Zig's unique advantages for systems programming to create a high-performance LLM inference engine:

Zero-cost abstractions with compile-time optimization
Direct hardware access for SIMD and platform-specific optimizations
Manual memory management without garbage collection pauses
Single binary deployment with no runtime dependencies
Cross-platform compilation for multiple architectures

🚀 BLAS Acceleration Achieved! We've successfully integrated Apple Accelerate backend delivering 1000+ GFLOPS performance - a 3000x speedup over the initial naive implementation. Measured on an M1 Macbook.

🔗 Related: See the main project README for architecture overview and vision.

Project Structure

experimental/
├── build.zig              # Build system configuration
├── build.zig.zon          # Package dependencies  
├── src/
│   ├── main.zig           # HTTP server entry point
│   ├── core/              # Core ML components
│   │   ├── root.zig       # Module exports
│   │   ├── tensor.zig     # SIMD-optimized tensors
│   │   ├── model.zig      # DeepSeek V3 model
│   │   ├── attention.zig  # MLA attention mechanism
│   │   ├── moe.zig        # Mixture of Experts
│   │   ├── tokenizer.zig  # Text tokenization
│   │   ├── backend.zig    # Backend abstraction
│   │   ├── memory.zig     # Memory management
│   │   └── math/          # Math utilities
│   │       ├── root.zig   # Math module exports
│   │       ├── simd.zig   # SIMD operations
│   │       ├── activation.zig  # Activation functions
│   │       └── rms_norm.zig    # RMS normalization
│   ├── web/               # HTTP API layer
│   │   ├── root.zig       # Web module exports
│   │   ├── server.zig     # HTTP server (std.http)
│   │   ├── handlers.zig   # Request handlers
│   │   ├── middleware.zig # CORS, auth, rate limiting
│   │   ├── websocket.zig  # WebSocket support
│   │   ├── openai.zig     # OpenAI API compatibility
│   │   ├── request.zig    # Request wrapper
│   │   └── response.zig   # Response wrapper
│   ├── backends/          # Compute backends
│   │   ├── cpu/           # CPU with SIMD
│   │   ├── metal/         # Apple Silicon
│   │   └── cuda/          # NVIDIA GPUs
│   └── wasm/
│       └── main.zig       # WebAssembly entry point
├── bench/
│   └── main.zig           # Performance benchmarks
└── README.md               # This file

Requirements

Zig 0.15.0-dev
Platform-specific requirements:
- macOS: Xcode Command Line Tools (for Metal backend)
- Linux: CUDA Toolkit (for CUDA backend, optional)
- Windows: CUDA Toolkit (for CUDA backend, optional)

Quick Start

Building

# Clone and navigate to experimental directory
cd experimental/

# Build the project
zig build

# Run the server
zig build run

# Run tests
zig build test

# Run benchmarks
zig build bench

# Build WebAssembly
zig build wasm

Running the Server

# Start server on default port (8080)
./zig-out/bin/deepseek-v3-zig

# Custom configuration
./zig-out/bin/deepseek-v3-zig --port 3000 --backend metal --model ./path/to/model

API Usage

The server exposes OpenAI-compatible endpoints:

# Chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-v3",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100
  }'

# Health check
curl http://localhost:8080/health

# Model info
curl http://localhost:8080/v1/models

Performance Features

SIMD Optimizations

x86_64: AVX2/AVX-512 vectorization for matrix operations
ARM64: NEON SIMD for Apple Silicon optimization
Auto-vectorization: Compiler-optimized loops with @Vector types

Backend Support

Backend	Status	Features
CPU	✅ Implemented	Multi-threaded, SIMD, cache-optimized
Metal	🚧 In Progress	Apple Silicon GPU, unified memory
CUDA	🚧 Planned	NVIDIA GPU, Tensor Cores
WebGPU	📋 Future	Browser GPU acceleration

Memory Management

Arena allocators for request-scoped memory
Memory pools for tensor allocations
Zero-copy operations where possible
Cache-friendly data layouts

Development Status

✅ Drafted

Project structure and build system
Core tensor operations with SIMD
HTTP server with OpenAI API compatibility
CPU backend with optimizations
Memory management utilities
Benchmark suite

🚧 In Progress

DeepSeek V3 model architecture
Multi-Head Latent Attention (MLA)
Mixture of Experts (MoE) implementation
Metal backend for Apple Silicon
Model loading and weight management

📋 Planned

CUDA backend for NVIDIA GPUs
WebSocket streaming
Model quantization (INT8, FP16)
Flash Attention optimization
Distributed inference
Advanced sampling strategies

Architecture Decisions

Why Zig?

Performance: Zero-cost abstractions without runtime overhead
Memory Safety: Compile-time memory management without GC
Simplicity: Single binary deployment, cross-compilation
Control: Direct hardware access for optimization

Design Principles

Modularity: Clean separation between core, web, and backend layers
Performance: SIMD-first design with cache-friendly algorithms
Compatibility: OpenAI API compatibility for easy adoption
Extensibility: Plugin architecture for new backends

Contributing

This is an experimental project! Contributions are welcome:

Core ML: Implement transformer layers, attention mechanisms
Backends: Optimize CUDA/Metal compute kernels
Performance: Profile and optimize bottlenecks
Testing: Add comprehensive test coverage
Documentation: Improve setup and usage guides

Development Setup

# Install Zig 0.15.0-dev
# https://ziglang.org/download/

# Clone repository
git clone [repository-url]
cd experimental/

# Run tests during development
zig build test --watch

# Format code
zig fmt src/

Benchmarks

Run benchmarks to measure performance:

zig build bench

Hardware Context: Benchmarks run on Apple M1 MacBook Pro (MacBookPro17,1) with 16GB unified memory, Zig 0.15.0-dev.703+597dd328e, debug build.

Example output:

🚀 DeepZig V3 Performance Benchmarks
==========================================

🎯 DYNAMIC BENCHMARK SUMMARY
===============================

📊 Matrix Multiplication Performance:
  • 256×256: 0.0 ms, 937 GFLOPS
  • 512×512: 0.2 ms, 1084 GFLOPS  
  • 1024×1024: 2.1 ms, 1164 GFLOPS
  • 2048×2048: 20.9 ms, 823 GFLOPS
  🏆 Peak measured: 1164 GFLOPS at 1024×1024

🧮 BLAS Configuration:
  • Backend: Apple Accelerate
  • Theoretical peak: 2600 GFLOPS (estimated)

➕ Tensor Operations:
  • SIMD Addition: 3.5 GB/s

💾 Memory Performance:
  • Copy Bandwidth: 20.9 GB/s
  • Random Access Latency: 1.8 ns

🎯 Performance Assessment:
  ✅ Acceptable: BLAS delivering 1000+ GFLOPS
  • Est. efficiency: 44% (vs theoretical peak)

Note: Benchmarked on Apple M1 MacBook Pro under heavy load 
(should be significantly higher on a clean system).

Performance Results (Apple M1 MacBook Pro under heavy load):

Matrix 256×256: 0.0ms/iter, 937 GFLOPS
Matrix 512×512: 0.2ms/iter, 1084 GFLOPS (peak performance)
Matrix 1024×1024: 2.1ms/iter, 1164 GFLOPS
Matrix 2048×2048: 20.9ms/iter, 823 GFLOPS

Performance Achievement: From 6418ms naive → 2.2ms BLAS = 2900x speedup on matrix operations

System Status:

✅ BLAS Backend: Apple Accelerate integration delivering acceptable performance
✅ Peak Performance: 1164 GFLOPS measured (44% of theoretical maximum, impressive under load)
✅ Memory Bandwidth: 20.9 GB/s copying, well-optimized operations
✅ Hardware Detection: M-series Apple Silicon detection functional

Known Issues

Model Loading: Currently creates dummy models - real weight loading not implemented
Tokenizer: Placeholder implementation - needs proper BPE tokenizer
WebSocket: Basic structure only - streaming not implemented
Metal/CUDA: Backend stubs only - GPU kernels not implemented

License

This experimental implementation follows the same license as the original DeepSeek V3 project.

Resources

Is This Ready for Production?

No - this is a research/development foundation. But it's theoretical and compiles:

What works now: ✅ Compiles and runs with Zig 0.15.0-dev, HTTP server, tensor operations, SIMD math, benchmarks execute successfully
What's missing: Optimized matrix operations, actual DeepSeek V3 model implementation
Timeline: Foundation is compiling, model implementation is the next major milestone

Comparison to Other Projects

Project	Language	Status	Focus
This	Zig	Foundation + API	Web-first inference
llama.cpp	C++	Production	CLI/library
Candle	Rust	Production	ML framework
ZML	Zig	Research	Low-level ML ops

Unique advantages: Built-in web server, Zig's zero-cost abstractions, single binary deployment.

⚡ Built with Zig for blazing fast LLM inference!

Performance Notes

Current Status: ✅ BLAS integration working - Apple Accelerate backend now functional in draft implementation.