mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-06-20 08:33:47 -04:00
4615 lines
163 KiB
Markdown
4615 lines
163 KiB
Markdown
<div align="center">
|
|
<img src="./dzv3-logo.svg" alt="DeepSeek V3 in Zig" width="100%" />
|
|
</div>
|
|
<hr>
|
|
<div align="center" style="line-height: 1.5;">
|
|
<a href="https://ziglang.org/"><img src="https://img.shields.io/badge/Language-Zig-F7A41D?style=for-the-badge&logo=zig&logoColor=white" alt="Language: Zig"></a>
|
|
<a href="LICENSE-CODE"><img src="https://img.shields.io/badge/License-DSV3-blue.svg?style=for-the-badge" alt="License: DeepSeek"></a>
|
|
<a href="#status"><img src="https://img.shields.io/badge/Status-Proposal-orange?style=for-the-badge" alt="Status: Proposal"></a>
|
|
<br>
|
|
<a href="#why-propose-deepseek-v3-in-zig"><img src="https://img.shields.io/badge/Performance-High_Efficiency-44CC11?style=for-the-badge" alt="Performance: High Efficiency"></a>
|
|
<a href="#platform-specific-optimizations"><img src="https://img.shields.io/badge/Platform-Cross_Platform-5A6AB1?style=for-the-badge" alt="Platform: Cross Platform"></a>
|
|
<br>
|
|
<a href="#core-system"><img src="https://img.shields.io/badge/Feature-SIMD_Optimized-1DA1F2?style=for-the-badge" alt="Feature: SIMD Optimized"></a>
|
|
<a href="#model-architecture"><img src="https://img.shields.io/badge/Architecture-MoE-F94877?style=for-the-badge" alt="Architecture: MoE"></a>
|
|
<a href="#computation-backend"><img src="https://img.shields.io/badge/Backend-Customizable-6236FF?style=for-the-badge" alt="Backend: Customizable"></a>
|
|
</div>
|
|
<hr />
|
|
|
|
<h1 align="center"> DeepSeek V3 in Zig: `DeepZig V3` Proposal </h1>
|
|
|
|
## Overview
|
|
|
|
This document outlines the architecture for implementing DeepSeek V3 in the Zig programming language. The focus is on leveraging Zig's unique features to create a high-performance, memory-efficient, and robust implementation of the DeepSeek V3 architecture.
|
|
|
|
1. **Superior Performance**: Leverage Zig's compile-time metaprogramming, SIMD vectorization, and low-level control to achieve optimal performance across platforms
|
|
2. **Memory Efficiency**: Utilize Zig's explicit allocator system and arena allocation patterns for precise resource management
|
|
3. **Concurrent Processing**: Implement efficient parallel execution using Zig's advanced async/await framework and evented I/O
|
|
4. **Type Safety & Reliability**: Employ Zig's strong type system, comptime checks, and explicit error handling to prevent runtime errors
|
|
5. **Cross-Platform Support**: Create a portable implementation with seamless support across architectures (x86_64, ARM64, etc.)
|
|
|
|
## Table of Contents
|
|
1. [Overview](#overview)
|
|
2. [System Architecture](#system-architecture)
|
|
3. [Component Design](#component-design)
|
|
1. [Core System](#core-system)
|
|
1. [Memory Management System](#memory-management-system)
|
|
2. [Tensor Implementation](#tensor-implementation)
|
|
3. [Error Handling Framework](#error-handling-framework)
|
|
4. [Concurrency Model](#concurrency-model)
|
|
2. [Model Architecture](#model-architecture)
|
|
1. [Transformer Core](#transformer-core)
|
|
2. [Attention Mechanisms](#attention-mechanisms)
|
|
3. [Mixture of Experts (MoE)](#mixture-of-experts-moe)
|
|
3. [Computation Backend](#computation-backend)
|
|
1. [Backend Interface](#backend-interface)
|
|
2. [Metal Integration for Apple Silicon](#metal-integration-for-apple-silicon)
|
|
4. [Inference Pipeline](#inference-pipeline)
|
|
1. [Model Loading](#model-loading)
|
|
2. [Generation Strategies](#generation-strategies)
|
|
5. [Optimization Layer](#optimization-layer)
|
|
1. [Compile-Time Optimizations](#compile-time-optimizations)
|
|
2. [Quantization Framework](#quantization-framework)
|
|
4. [Platform-Specific Optimizations](#platform-specific-optimizations)
|
|
5. [Development Roadmap](#development-roadmap)
|
|
6. [Why Propose DeepSeek V3 in Zig?](#why-propose-deepseek-v3-in-zig)
|
|
|
|
## System Architecture
|
|
|
|
### High-Level Component Overview
|
|
|
|
The DeepSeek V3 Zig implementation consists of the following major components:
|
|
|
|
```
|
|
DeepSeek V3 Zig
|
|
│
|
|
├── Core
|
|
│ ├── Memory Management System
|
|
│ │ ├── Custom Allocator Framework
|
|
│ │ ├── Arena Allocation Strategy
|
|
│ │ └── Memory Pool Implementation
|
|
│ ├── Tensor Implementation
|
|
│ │ ├── SIMD-Optimized Operations
|
|
│ │ ├── Compile-Time Specialization
|
|
│ │ └── Zero-Cost Abstractions
|
|
│ └── Error Handling Framework
|
|
│ ├── Comprehensive Error Types
|
|
│ └── Performance-Optimized Error Paths
|
|
│
|
|
├── Model Architecture
|
|
│ ├── Transformer Layers
|
|
│ │ ├── Comptime-Generated Layer Variants
|
|
│ │ └── Optimized Forward Pass
|
|
│ ├── Attention Mechanisms
|
|
│ │ ├── Vectorized Multi-Head Attention
|
|
│ │ └── Efficient KV-Cache Management
|
|
│ ├── MoE (Mixture of Experts)
|
|
│ │ ├── Parallel Expert Execution
|
|
│ │ └── Optimized Router Implementation
|
|
│ └── Embedding Systems
|
|
│ ├── Memory-Efficient Token Embeddings
|
|
│ └── Positional Encoding Optimizations
|
|
│
|
|
├── Computation Backend
|
|
│ ├── CPU Implementation
|
|
│ │ ├── SIMD Vectorization
|
|
│ │ └── Multi-Threaded Execution
|
|
│ ├── GPU Integration (Optional)
|
|
│ │ ├── CUDA Support (NVIDIA)
|
|
│ │ ├── Metal Support (Apple)
|
|
│ │ └── ROCm Support (AMD)
|
|
│ └── Backend Interface Layer
|
|
│ ├── Zero-Cost Abstraction
|
|
│ └── Compile-Time Dispatch
|
|
│
|
|
├── Inference Pipeline
|
|
│ ├── Model Loading & Weight Management
|
|
│ ├── Tokenization System
|
|
│ ├── Advanced Generation Strategies
|
|
│ │ ├── Speculative Decoding
|
|
│ │ └── Beam Search
|
|
│ └── Streaming Output Processing
|
|
│
|
|
└── Optimization Layer
|
|
├── Compile-Time Specialization
|
|
│ ├── Architecture-Specific Code Gen
|
|
│ └── Tensor Operation Optimization
|
|
├── Runtime Performance Tuning
|
|
│ ├── Cache-Aware Memory Layout
|
|
│ └── Workload Balancing
|
|
└── Quantization Framework
|
|
├── Mixed-Precision Support
|
|
└── Hardware-Accelerated Execution
|
|
```
|
|
|
|
## Detailed Component Design
|
|
|
|
### 1. Core Systems
|
|
|
|
#### 1.1 Memory Management System
|
|
|
|
Memory management in Zig represents a significant advancement over Python's garbage collection. Zig provides explicit allocator interfaces that give fine-grained control over memory allocation and deallocation strategies:
|
|
|
|
```zig
|
|
const std = @import("std");
|
|
|
|
// Define a custom tensor allocator that combines multiple strategies
|
|
pub const TensorAllocator = struct {
|
|
// Use arena for temporary tensor operations during inference
|
|
arena: std.heap.ArenaAllocator,
|
|
// Use a fixed buffer for small activations
|
|
fixed_buffer: [1024 * 1024]u8 = undefined,
|
|
fixed_allocator: std.heap.FixedBufferAllocator,
|
|
// General purpose allocator for long-lived objects
|
|
gpa: std.heap.GeneralPurposeAllocator(.{}),
|
|
|
|
pub fn init(backing_allocator: std.mem.Allocator) !*TensorAllocator {
|
|
var self = try backing_allocator.create(TensorAllocator);
|
|
self.* = .{
|
|
.arena = std.heap.ArenaAllocator.init(backing_allocator),
|
|
.fixed_allocator = std.heap.FixedBufferAllocator.init(&self.fixed_buffer),
|
|
.gpa = std.heap.GeneralPurposeAllocator(.{}){},
|
|
};
|
|
return self;
|
|
}
|
|
|
|
pub fn deinit(self: *TensorAllocator) void {
|
|
self.arena.deinit();
|
|
_ = self.gpa.deinit();
|
|
// backing allocator will free self
|
|
}
|
|
|
|
// Get the right allocator for specific tensor use cases
|
|
pub fn temporaryAllocator(self: *TensorAllocator) std.mem.Allocator {
|
|
return self.arena.allocator();
|
|
}
|
|
|
|
pub fn smallActivationAllocator(self: *TensorAllocator) std.mem.Allocator {
|
|
return self.fixed_allocator.allocator();
|
|
}
|
|
|
|
pub fn persistentAllocator(self: *TensorAllocator) std.mem.Allocator {
|
|
return self.gpa.allocator();
|
|
}
|
|
};
|
|
|
|
// Inference function example with specialized memory allocation
|
|
pub fn performInference(model: *Model, input: Tensor) !Tensor {
|
|
var allocator = try TensorAllocator.init(std.heap.page_allocator);
|
|
defer allocator.deinit();
|
|
|
|
// Use different allocators for different tensor operations
|
|
var activations = try computeActivations(model, input, allocator.temporaryAllocator());
|
|
var weights = try loadModelWeights(model, allocator.persistentAllocator());
|
|
|
|
// Results are automatically freed when the arena is deinitialized
|
|
return try generateOutput(activations, weights, allocator.temporaryAllocator());
|
|
}
|
|
```
|
|
|
|
**Key Features:**
|
|
- **Tiered Allocation Strategy**: Different allocators for different memory usage patterns
|
|
- **Arena Allocation**: Bulk allocation and freeing for intermediate tensors, dramatically reducing memory management overhead
|
|
- **Fixed Buffer Allocation**: Zero-heap-allocation path for small, predictable tensor operations
|
|
- **Memory Pool Implementation**: Custom pools for tensor data to minimize fragmentation
|
|
- **Explicit Error Handling**: All allocation failures are explicitly handled with Zig's error system
|
|
|
|
#### 1.2 Tensor Implementation
|
|
|
|
Tensors are the fundamental data structure for DeepSeek. Our implementation leverages Zig's advanced compile-time features, SIMD capabilities, and memory layout optimizations for maximum performance:
|
|
|
|
```zig
|
|
pub fn Tensor(comptime DataType: type, comptime dimensions: usize) type {
|
|
return struct {
|
|
const Self = @This();
|
|
|
|
data: []DataType,
|
|
shape: [dimensions]usize,
|
|
strides: [dimensions]usize,
|
|
allocator: std.mem.Allocator,
|
|
is_contiguous: bool,
|
|
|
|
// Vector types for SIMD operations based on hardware capabilities
|
|
pub const VecType = switch (DataType) {
|
|
f32 => if (@hasDecl(builtin, "cpu") and @hasDecl(builtin.cpu, "avx"))
|
|
@Vector(8, f32) // AVX
|
|
else if (@hasDecl(builtin, "cpu") and @hasDecl(builtin.cpu, "sse"))
|
|
@Vector(4, f32) // SSE
|
|
else
|
|
@Vector(4, f32), // Fallback
|
|
f16 => @Vector(8, f16),
|
|
i32 => @Vector(8, i32),
|
|
i8 => @Vector(16, i8),
|
|
else => @compileError("Unsupported data type for SIMD"),
|
|
};
|
|
|
|
// Number of elements in the SIMD vector
|
|
pub const vec_width = @sizeOf(VecType) / @sizeOf(DataType);
|
|
|
|
pub fn init(allocator: std.mem.Allocator, shape: [dimensions]usize) !Self {
|
|
var strides: [dimensions]usize = undefined;
|
|
var total_size: usize = 1;
|
|
|
|
// Calculate C-contiguous (row-major) strides for optimal memory access
|
|
var i: usize = dimensions;
|
|
while (i > 0) {
|
|
i -= 1;
|
|
strides[i] = total_size;
|
|
total_size *= shape[i];
|
|
}
|
|
|
|
// Align memory for optimal SIMD access
|
|
const alignment = @alignOf(VecType);
|
|
const data = try allocator.alignedAlloc(DataType, alignment, total_size);
|
|
|
|
return Self{
|
|
.data = data,
|
|
.shape = shape,
|
|
.strides = strides,
|
|
.allocator = allocator,
|
|
.is_contiguous = true,
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
self.allocator.free(self.data);
|
|
}
|
|
|
|
// Optimized SIMD matrix multiplication for 2D tensors
|
|
pub fn matmul(self: *Self, other: *Self, allocator: std.mem.Allocator) !Self {
|
|
std.debug.assert(dimensions == 2 and other.dimensions == 2);
|
|
std.debug.assert(self.shape[1] == other.shape[0]);
|
|
|
|
const M = self.shape[0];
|
|
const K = self.shape[1];
|
|
const N = other.shape[1];
|
|
|
|
var result = try Self.init(allocator, .{ M, N });
|
|
|
|
// Zero initialization
|
|
@memset(result.data, 0);
|
|
|
|
// Check if both tensors are contiguous for optimal performance
|
|
if (self.is_contiguous and other.is_contiguous) {
|
|
// Cache-aware blocked matrix multiplication with SIMD
|
|
const block_size = 64; // Tuned for L1 cache
|
|
|
|
// For each block
|
|
var i: usize = 0;
|
|
while (i < M) : (i += block_size) {
|
|
const i_end = @min(i + block_size, M);
|
|
var j: usize = 0;
|
|
while (j < N) : (j += block_size) {
|
|
const j_end = @min(j + block_size, N);
|
|
var k: usize = 0;
|
|
while (k < K) : (k += block_size) {
|
|
const k_end = @min(k + block_size, K);
|
|
|
|
// Process each block
|
|
var ii: usize = i;
|
|
while (ii < i_end) : (ii += 1) {
|
|
var jj: usize = j;
|
|
while (jj < j_end) : (jj += vec_width) {
|
|
// SIMD-optimized inner loop
|
|
if (jj + vec_width <= j_end) {
|
|
var sum: VecType = @splat(0);
|
|
var kk: usize = k;
|
|
while (kk < k_end) : (kk += 1) {
|
|
const a_val = self.data[ii * K + kk];
|
|
const b_vec: VecType = blk: {
|
|
var tmp: [vec_width]DataType = undefined;
|
|
for (0..vec_width) |v| {
|
|
if (jj + v < j_end) {
|
|
tmp[v] = other.data[kk * N + (jj + v)];
|
|
} else {
|
|
tmp[v] = 0;
|
|
}
|
|
}
|
|
break :blk tmp;
|
|
};
|
|
sum += @splat(a_val) * b_vec;
|
|
}
|
|
|
|
// Store result
|
|
for (0..vec_width) |v| {
|
|
if (jj + v < j_end) {
|
|
result.data[ii * N + (jj + v)] += sum[v];
|
|
}
|
|
}
|
|
} else {
|
|
// Handle remaining columns (tail)
|
|
while (jj < j_end) : (jj += 1) {
|
|
var sum: DataType = 0;
|
|
var kk: usize = k;
|
|
while (kk < k_end) : (kk += 1) {
|
|
sum += self.data[ii * K + kk] * other.data[kk * N + jj];
|
|
}
|
|
result.data[ii * N + jj] += sum;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
} else {
|
|
// Fallback for non-contiguous tensors
|
|
var i: usize = 0;
|
|
while (i < M) : (i += 1) {
|
|
var j: usize = 0;
|
|
while (j < N) : (j += 1) {
|
|
var sum: DataType = 0;
|
|
var k: usize = 0;
|
|
while (k < K) : (k += 1) {
|
|
sum += self.at(.{i, k}) * other.at(.{k, j});
|
|
}
|
|
try result.set(.{i, j}, sum);
|
|
}
|
|
}
|
|
}
|
|
|
|
return result;
|
|
}
|
|
|
|
// Access element at specific indices
|
|
pub fn at(self: Self, indices: [dimensions]usize) DataType {
|
|
var offset: usize = 0;
|
|
inline for (0..dimensions) |i| {
|
|
offset += indices[i] * self.strides[i];
|
|
}
|
|
return self.data[offset];
|
|
}
|
|
|
|
// Set element at specific indices
|
|
pub fn set(self: *Self, indices: [dimensions]usize, value: DataType) !void {
|
|
var offset: usize = 0;
|
|
inline for (0..dimensions) |i| {
|
|
offset += indices[i] * self.strides[i];
|
|
}
|
|
self.data[offset] = value;
|
|
}
|
|
|
|
// Apply element-wise operations with SIMD acceleration
|
|
pub fn map(self: Self, comptime op: fn (DataType) DataType, allocator: std.mem.Allocator) !Self {
|
|
var result = try Self.init(allocator, self.shape);
|
|
|
|
// Use SIMD operations for contiguous data
|
|
if (self.is_contiguous) {
|
|
var i: usize = 0;
|
|
const vec_chunks = self.data.len / vec_width;
|
|
|
|
// Process in SIMD chunks
|
|
while (i < vec_chunks) : (i += 1) {
|
|
const base_idx = i * vec_width;
|
|
var vec: VecType = undefined;
|
|
|
|
// Load vector
|
|
for (0..vec_width) |j| {
|
|
vec[j] = self.data[base_idx + j];
|
|
}
|
|
|
|
// Apply operation on each vector element
|
|
for (0..vec_width) |j| {
|
|
vec[j] = op(vec[j]);
|
|
}
|
|
|
|
// Store result
|
|
for (0..vec_width) |j| {
|
|
result.data[base_idx + j] = vec[j];
|
|
}
|
|
}
|
|
|
|
// Process remaining elements
|
|
const remaining_start = vec_chunks * vec_width;
|
|
for (remaining_start..self.data.len) |j| {
|
|
result.data[j] = op(self.data[j]);
|
|
}
|
|
} else {
|
|
// Fallback for non-contiguous data
|
|
var indices: [dimensions]usize = .{0} ** dimensions;
|
|
var done = false;
|
|
|
|
while (!done) {
|
|
const val = self.at(indices);
|
|
try result.set(indices, op(val));
|
|
|
|
// Increment indices
|
|
var d = dimensions - 1;
|
|
while (true) {
|
|
indices[d] += 1;
|
|
if (indices[d] < self.shape[d]) break;
|
|
indices[d] = 0;
|
|
if (d == 0) {
|
|
done = true;
|
|
break;
|
|
}
|
|
d -= 1;
|
|
}
|
|
}
|
|
}
|
|
|
|
return result;
|
|
}
|
|
};
|
|
}
|
|
|
|
// Specialized tensor types for common uses
|
|
const FloatTensor1D = Tensor(f32, 1);
|
|
const FloatTensor2D = Tensor(f32, 2);
|
|
const FloatTensor4D = Tensor(f32, 4); // Common for batch x height x width x channels
|
|
const QuantizedTensor4D = Tensor(i8, 4); // For quantized operations
|
|
```
|
|
|
|
**Key Features:**
|
|
- **Hardware-Aware SIMD Vectorization**: Automatically selects optimal vector width based on CPU capabilities (AVX, SSE)
|
|
- **Cache-Optimized Algorithms**: Blocked matrix multiplication designed for L1/L2 cache efficiency
|
|
- **Aligned Memory Allocation**: Ensures data is properly aligned for SIMD operations
|
|
- **Specialized Tensor Types**: Pre-defined tensor configurations for common use cases
|
|
- **Automatic Fallbacks**: Graceful degradation for non-contiguous tensors or unsupported operations
|
|
- **Compile-Time Optimization**: Tensor dimensions and data types resolved at compile time for maximum performance
|
|
- **Zero-Runtime Overhead**: SIMD operations with no dynamic dispatch or virtual function calls
|
|
|
|
#### 1.3 Error Handling Framework
|
|
|
|
Zig's error handling system provides a powerful foundation for creating robust, high-performance software. Unlike exceptions in languages like C++ or Python, Zig's error handling is explicit and deterministic, making it particularly well-suited for large-scale machine learning applications:
|
|
|
|
```zig
|
|
// Define a comprehensive set of potential errors with clear semantic meaning
|
|
const ModelError = error{
|
|
ModelLoadFailed,
|
|
InvalidDimension,
|
|
InvalidShape,
|
|
OutOfMemory,
|
|
ComputeBackendError,
|
|
InvalidWeight,
|
|
UnsupportedOperation,
|
|
UnsupportedDataType,
|
|
DeviceNotAvailable,
|
|
TensorShapeMismatch,
|
|
QuantizationError,
|
|
InvalidConfiguration,
|
|
};
|
|
|
|
// Union error sets for comprehensive error handling
|
|
const DeepSeekError = ModelError || TensorError || AllocationError || IoError;
|
|
|
|
// Example function demonstrating Zig's error handling with defer for cleanup
|
|
fn loadModel(allocator: std.mem.Allocator, path: []const u8) DeepSeekError!*Model {
|
|
var file = try std.fs.cwd().openFile(path, .{});
|
|
defer file.close(); // Ensures file is closed even if an error occurs
|
|
|
|
var buffer = std.ArrayList(u8).init(allocator);
|
|
defer buffer.deinit(); // Clean up buffer regardless of success/failure
|
|
|
|
try buffer.ensureTotalCapacity(file.getEndPos() catch return ModelError.ModelLoadFailed);
|
|
|
|
const bytes_read = try file.readAll(buffer.items);
|
|
if (bytes_read == 0) return ModelError.ModelLoadFailed;
|
|
|
|
var model = try allocator.create(Model);
|
|
errdefer allocator.destroy(model); // Only called if an error occurs after this point
|
|
|
|
model.* = Model.init(allocator);
|
|
errdefer model.deinit(); // Only called if an error occurs after this point
|
|
|
|
// Parse weights and initialize model...
|
|
if (!try parseWeights(model, buffer.items)) {
|
|
return ModelError.InvalidWeight;
|
|
}
|
|
|
|
return model;
|
|
}
|
|
|
|
// Demonstrate error handling in caller code
|
|
pub fn main() !void {
|
|
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
|
|
defer _ = gpa.deinit();
|
|
const allocator = gpa.allocator();
|
|
|
|
// Handle errors explicitly with try/catch blocks
|
|
const model = loadModel(allocator, "model.bin") catch |err| {
|
|
switch (err) {
|
|
ModelError.ModelLoadFailed => {
|
|
std.debug.print("Failed to load model file\n", .{});
|
|
return err;
|
|
},
|
|
ModelError.InvalidWeight => {
|
|
std.debug.print("Model contains invalid weights\n", .{});
|
|
return err;
|
|
},
|
|
else => {
|
|
std.debug.print("Unexpected error: {}\n", .{err});
|
|
return err;
|
|
},
|
|
}
|
|
};
|
|
defer model.deinit();
|
|
|
|
// Continue with model usage...
|
|
}
|
|
```
|
|
|
|
**Key Features:**
|
|
- **Explicit Error Types**: Clearly defined error sets that precisely describe what can go wrong
|
|
- **No Exceptions**: Deterministic error handling with no hidden control flow
|
|
- **Resource Safety**: Automatic cleanup with `defer` and `errdefer` ensures resources are properly managed
|
|
- **Performance Optimization**: Error handling doesn't rely on stack unwinding or dynamic dispatch
|
|
- **Composable Error Sets**: Error types can be combined using the `||` operator
|
|
- **Try-Catch Blocks**: For selective error handling when needed
|
|
- **Error Tracing**: Built-in error return trace capability for debugging
|
|
|
|
#### 1.4 Concurrency Model
|
|
|
|
Zig's concurrency model will be leveraged to parallelize computation-intensive operations in DeepSeek. Zig's async/await syntax provides a structured approach to concurrency without the overhead of traditional threading:
|
|
|
|
```zig
|
|
const std = @import("std");
|
|
|
|
// Thread pool for CPU-bound parallel tasks
|
|
pub const ComputeThreadPool = struct {
|
|
pool: std.Thread.Pool,
|
|
completion_count: std.atomic.Atomic(usize),
|
|
|
|
pub fn init(thread_count: usize) !ComputeThreadPool {
|
|
var pool: std.Thread.Pool = undefined;
|
|
try pool.init(.{
|
|
.allocator = std.heap.c_allocator,
|
|
.n_jobs = thread_count,
|
|
});
|
|
|
|
return ComputeThreadPool{
|
|
.pool = pool,
|
|
.completion_count = std.atomic.Atomic(usize).init(0),
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *ComputeThreadPool) void {
|
|
self.pool.deinit();
|
|
}
|
|
|
|
// Execute a compute task asynchronously
|
|
pub fn compute(self: *ComputeThreadPool, task: *const fn(*anyopaque) void, context: *anyopaque) !void {
|
|
try self.pool.spawn(task, context);
|
|
}
|
|
|
|
// Wait for all compute tasks to complete
|
|
pub fn waitAll(self: *ComputeThreadPool) void {
|
|
// Process tasks in the event loop until all are complete
|
|
while (self.completion_count.load(.Acquire) > 0) {
|
|
std.time.sleep(1 * std.time.millisecond);
|
|
}
|
|
}
|
|
};
|
|
|
|
// Parallel tensor operation example with async/await
|
|
pub fn parallelMatMul(allocator: std.mem.Allocator, a: *Tensor(f32, 2), b: *Tensor(f32, 2)) !*Tensor(f32, 2) {
|
|
const M = a.shape[0];
|
|
const K = a.shape[1];
|
|
const N = b.shape[1];
|
|
|
|
var result = try Tensor(f32, 2).init(allocator, .{M, N});
|
|
errdefer result.deinit();
|
|
|
|
@memset(result.data, 0);
|
|
|
|
// Create thread pool with optimal number of threads
|
|
const cpu_count = try std.Thread.getCpuCount();
|
|
var thread_pool = try ComputeThreadPool.init(cpu_count);
|
|
defer thread_pool.deinit();
|
|
|
|
// Split work based on number of available cores
|
|
const rows_per_thread = (M + cpu_count - 1) / cpu_count;
|
|
|
|
// Define the worker task
|
|
const WorkContext = struct {
|
|
a: *const Tensor(f32, 2),
|
|
b: *const Tensor(f32, 2),
|
|
result: *Tensor(f32, 2),
|
|
start_row: usize,
|
|
end_row: usize,
|
|
thread_pool: *ComputeThreadPool,
|
|
};
|
|
|
|
// Worker function for computing a subset of rows
|
|
const workerFn = struct {
|
|
fn compute(context_ptr: *anyopaque) void {
|
|
const context = @ptrCast(*WorkContext, @alignCast(@alignOf(WorkContext), context_ptr));
|
|
const a = context.a;
|
|
const b = context.b;
|
|
const result = context.result;
|
|
const start_row = context.start_row;
|
|
const end_row = context.end_row;
|
|
|
|
// Compute assigned rows
|
|
for (start_row..end_row) |i| {
|
|
if (i >= a.shape[0]) break;
|
|
|
|
for (0..b.shape[1]) |j| {
|
|
var sum: f32 = 0.0;
|
|
for (0..a.shape[1]) |k| {
|
|
sum += a.at(.{i, k}) * b.at(.{k, j});
|
|
}
|
|
result.set(.{i, j}, sum) catch {};
|
|
}
|
|
}
|
|
|
|
// Mark task as complete
|
|
_ = context.thread_pool.completion_count.fetchSub(1, .Release);
|
|
}
|
|
};
|
|
|
|
// Spawn workers for each section of the matrix
|
|
for (0..cpu_count) |i| {
|
|
const start_row = i * rows_per_thread;
|
|
const end_row = std.math.min(start_row + rows_per_thread, M);
|
|
|
|
if (start_row >= M) break;
|
|
|
|
// Create context for this worker
|
|
var context = try allocator.create(WorkContext);
|
|
context.* = .{
|
|
.a = a,
|
|
.b = b,
|
|
.result = result,
|
|
.start_row = start_row,
|
|
.end_row = end_row,
|
|
.thread_pool = &thread_pool,
|
|
};
|
|
|
|
// Increment completion counter before spawning task
|
|
_ = thread_pool.completion_count.fetchAdd(1, .Release);
|
|
|
|
// Spawn the worker task
|
|
try thread_pool.compute(workerFn.compute, context);
|
|
}
|
|
|
|
// Wait for all tasks to complete
|
|
thread_pool.waitAll();
|
|
|
|
return result;
|
|
}
|
|
```
|
|
|
|
**Key Features:**
|
|
- **Thread Pool Management**: Efficient worker thread allocation based on available CPU cores
|
|
- **Work Partitioning**: Automatic division of work across available cores
|
|
- **Minimal Synchronization**: Lock-free atomic counters for synchronization when needed
|
|
- **Resource Safety**: Proper cleanup with `defer` and `errdefer` even during concurrent execution
|
|
- **Structured Concurrency**: Clear task dependencies and lifecycle management
|
|
- **Zero Runtime Overhead**: No garbage collection or runtime dependencies
|
|
|
|
### 2. Model Architecture
|
|
|
|
#### 2.1 Transformer Core
|
|
|
|
The transformer architecture is the foundation of DeepSeek V3. Our Zig implementation will leverage compile-time metaprogramming and advanced memory optimizations for maximum performance:
|
|
|
|
```zig
|
|
const std = @import("std");
|
|
|
|
// Precomputed type variants for different data precisions
|
|
pub const DataType = enum {
|
|
f32, // 32-bit floating point (for debugging/development)
|
|
bf16, // BFloat16 (for training/default inference)
|
|
f16, // Float16 (for hardware with native f16 support)
|
|
i8, // 8-bit integer (for quantized inference)
|
|
i4, // 4-bit integer (for extreme quantization)
|
|
};
|
|
|
|
// Configuration struct with default values matching DeepSeek V3
|
|
pub const ModelArgs = struct {
|
|
// Core model parameters
|
|
max_batch_size: usize = 8,
|
|
max_seq_len: usize = 4096 * 4,
|
|
data_type: DataType = .bf16,
|
|
vocab_size: usize = 102400,
|
|
dim: usize = 2048,
|
|
inter_dim: usize = 10944,
|
|
moe_inter_dim: usize = 1408,
|
|
n_layers: usize = 27,
|
|
n_dense_layers: usize = 1,
|
|
n_heads: usize = 16,
|
|
|
|
// MoE configuration
|
|
n_routed_experts: usize = 64,
|
|
n_shared_experts: usize = 2,
|
|
n_activated_experts: usize = 6,
|
|
n_expert_groups: usize = 1,
|
|
n_limited_groups: usize = 1,
|
|
score_func: enum { softmax, sigmoid } = .softmax,
|
|
route_scale: f32 = 1.0,
|
|
|
|
// MLA configuration
|
|
q_lora_rank: usize = 0,
|
|
kv_lora_rank: usize = 512,
|
|
qk_nope_head_dim: usize = 128,
|
|
qk_rope_head_dim: usize = 64,
|
|
v_head_dim: usize = 128,
|
|
|
|
// Positional encoding
|
|
original_seq_len: usize = 4096,
|
|
rope_theta: f32 = 10000.0,
|
|
rope_factor: f32 = 40,
|
|
beta_fast: usize = 32,
|
|
beta_slow: usize = 1,
|
|
mscale: f32 = 1.0,
|
|
|
|
// Runtime options
|
|
use_flash_attention: bool = true, // Use optimized attention implementation
|
|
use_parallel_experts: bool = true, // Run experts in parallel
|
|
max_token_limit: ?usize = null, // Optional token generation limit
|
|
|
|
// Generate optimized implementations based on config parameters
|
|
pub fn getModelType(self: @This()) type {
|
|
return struct {
|
|
const ModelType = @This();
|
|
const config = self;
|
|
|
|
// Select optimal types based on data_type
|
|
pub const StorageType = switch (config.data_type) {
|
|
.f32 => f32,
|
|
.bf16 => std.packed_bf16,
|
|
.f16 => f16,
|
|
.i8 => i8,
|
|
.i4 => i4,
|
|
};
|
|
|
|
// Define tensor types for different dimensions
|
|
pub const WeightTensor = Tensor(StorageType, 2);
|
|
pub const ActivationTensor = Tensor(f32, 3); // Always use f32 for activations
|
|
pub const EmbeddingTensor = Tensor(StorageType, 2);
|
|
pub const KVCacheTensor = Tensor(f32, 4); // [batch, seq_len, heads, dim]
|
|
|
|
// Generate layer configuration
|
|
pub const layer_config = struct {
|
|
pub const head_dim = (config.dim / config.n_heads);
|
|
pub const moe_layers_start = config.n_dense_layers;
|
|
};
|
|
};
|
|
}
|
|
};
|
|
|
|
// Main transformer model implementation
|
|
pub fn TransformerModel(comptime args: ModelArgs) type {
|
|
// Use comptime to generate a specialized model implementation based on args
|
|
return struct {
|
|
const Self = @This();
|
|
const ModelType = args.getModelType();
|
|
|
|
// Model components
|
|
allocator: std.mem.Allocator,
|
|
embedding: Embedding(args),
|
|
layers: []TransformerBlock(args),
|
|
norm: RMSNorm(args.dim),
|
|
head: Linear(args.dim, args.vocab_size),
|
|
freqs_cis: Tensor(f32, 3), // [max_seq_len, 2, qk_rope_head_dim]
|
|
|
|
// KV cache for optimized inference
|
|
kv_cache: ?ModelType.KVCacheTensor,
|
|
|
|
pub fn init(allocator: std.mem.Allocator) !Self {
|
|
// Initialize components
|
|
var embedding = try Embedding(args).init(allocator);
|
|
errdefer embedding.deinit();
|
|
|
|
var layers = try allocator.alloc(TransformerBlock(args), args.n_layers);
|
|
errdefer allocator.free(layers);
|
|
|
|
// Create layers with appropriate configurations
|
|
for (layers, 0..) |*layer, i| {
|
|
const is_moe = i >= args.n_dense_layers;
|
|
layer.* = try TransformerBlock(args).init(allocator, i, is_moe);
|
|
}
|
|
|
|
var norm = try RMSNorm(args.dim).init(allocator);
|
|
errdefer norm.deinit();
|
|
|
|
var head = try Linear(args.dim, args.vocab_size).init(allocator, false);
|
|
errdefer head.deinit();
|
|
|
|
// Precompute positional encoding frequencies
|
|
var freqs_cis = try precomputeFreqsCis(allocator, args);
|
|
|
|
return Self{
|
|
.allocator = allocator,
|
|
.embedding = embedding,
|
|
.layers = layers,
|
|
.norm = norm,
|
|
.head = head,
|
|
.freqs_cis = freqs_cis,
|
|
.kv_cache = null,
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
self.embedding.deinit();
|
|
|
|
for (self.layers) |*layer| {
|
|
layer.deinit();
|
|
}
|
|
self.allocator.free(self.layers);
|
|
|
|
self.norm.deinit();
|
|
self.head.deinit();
|
|
self.freqs_cis.deinit();
|
|
|
|
if (self.kv_cache) |*cache| {
|
|
cache.deinit();
|
|
}
|
|
}
|
|
|
|
// Initialize KV cache for efficient inference
|
|
pub fn initKVCache(self: *Self) !void {
|
|
if (self.kv_cache != null) return;
|
|
|
|
const batch_size = args.max_batch_size;
|
|
const seq_len = args.max_seq_len;
|
|
const n_heads = args.n_heads;
|
|
const head_dim = ModelType.layer_config.head_dim;
|
|
|
|
self.kv_cache = try ModelType.KVCacheTensor.init(
|
|
self.allocator,
|
|
.{batch_size, seq_len, n_heads, head_dim * 2}
|
|
);
|
|
|
|
// Zero-initialize cache
|
|
@memset(self.kv_cache.?.data, 0);
|
|
}
|
|
|
|
// Forward pass through the transformer model
|
|
pub fn forward(self: *Self, token_ids: []const usize, start_pos: usize) !Tensor(f32, 2) {
|
|
const batch_size = 1; // Currently supporting batch_size=1 for inference
|
|
const seq_len = token_ids.len;
|
|
|
|
// Create tensor from token_ids
|
|
var input_tensor = try ModelType.ActivationTensor.init(
|
|
self.allocator,
|
|
.{batch_size, seq_len, args.dim}
|
|
);
|
|
defer input_tensor.deinit();
|
|
|
|
// Get embeddings for input tokens
|
|
try self.embedding.embed(token_ids, &input_tensor);
|
|
|
|
// Process through each transformer layer
|
|
var x = input_tensor;
|
|
const freqs_cis_slice = try self.freqs_cis.slice(.{start_pos, 0, 0}, .{start_pos + seq_len, 2, args.qk_rope_head_dim});
|
|
|
|
// Create attention mask for causal attention
|
|
var mask: ?Tensor(f32, 2) = null;
|
|
if (seq_len > 1) {
|
|
mask = try createCausalMask(self.allocator, seq_len);
|
|
defer if (mask) |*m| m.deinit();
|
|
}
|
|
|
|
// Process through transformer layers
|
|
for (self.layers) |*layer| {
|
|
x = try layer.forward(x, start_pos, freqs_cis_slice, mask);
|
|
}
|
|
|
|
// Apply final normalization
|
|
var normalized = try self.norm.forward(x);
|
|
defer normalized.deinit();
|
|
|
|
// Extract last token for prediction
|
|
var last_token = try normalized.slice(
|
|
.{0, seq_len - 1, 0},
|
|
.{batch_size, seq_len, args.dim}
|
|
);
|
|
defer last_token.deinit();
|
|
|
|
// Project to vocabulary
|
|
return try self.head.forward(last_token);
|
|
}
|
|
|
|
// Helper to create causal attention mask
|
|
fn createCausalMask(allocator: std.mem.Allocator, seq_len: usize) !Tensor(f32, 2) {
|
|
var mask = try Tensor(f32, 2).init(allocator, .{seq_len, seq_len});
|
|
errdefer mask.deinit();
|
|
|
|
for (0..seq_len) |i| {
|
|
for (0..seq_len) |j| {
|
|
const value: f32 = if (j <= i) 0.0 else -10000.0;
|
|
try mask.set(.{i, j}, value);
|
|
}
|
|
}
|
|
|
|
return mask;
|
|
}
|
|
};
|
|
}
|
|
|
|
// Generate specialized transformer based on configuration
|
|
pub fn createTransformer(allocator: std.mem.Allocator, args: ModelArgs) !*TransformerModel(args) {
|
|
var model = try allocator.create(TransformerModel(args));
|
|
errdefer allocator.destroy(model);
|
|
|
|
model.* = try TransformerModel(args).init(allocator);
|
|
return model;
|
|
}
|
|
```
|
|
|
|
This implementation leverages Zig's compile-time features to generate specialized model implementations based on the provided configuration parameters. The use of generic types and comptime evaluation allows for maximum performance optimization while maintaining code flexibility.
|
|
|
|
#### 2.2 Attention Mechanism
|
|
|
|
The Multi-Head Latent Attention (MLA) mechanism is a critical component of DeepSeek V3's performance. Our Zig implementation leverages compile-time specialization, SIMD vectorization, and cache-friendly algorithms for maximum efficiency:
|
|
|
|
```zig
|
|
// Generic MLA implementation with compile-time specialization
|
|
pub fn MLA(comptime args: ModelArgs) type {
|
|
return struct {
|
|
const Self = @This();
|
|
const ModelType = args.getModelType();
|
|
|
|
// Attention configuration
|
|
dim: usize,
|
|
n_heads: usize,
|
|
head_dim: usize,
|
|
q_lora_rank: usize,
|
|
kv_lora_rank: usize,
|
|
qk_nope_head_dim: usize,
|
|
qk_rope_head_dim: usize,
|
|
qk_head_dim: usize,
|
|
v_head_dim: usize,
|
|
softmax_scale: f32,
|
|
use_flash_attention: bool,
|
|
|
|
// Projection matrices
|
|
allocator: std.mem.Allocator,
|
|
wq: ?ColumnParallelLinear(args) = null, // Regular query projection
|
|
wq_a: ?Linear(args.dim, args.q_lora_rank) = null, // LoRA decomposition
|
|
q_norm: ?RMSNorm(args.q_lora_rank) = null, // LoRA normalization
|
|
wq_b: ?ColumnParallelLinear(args) = null, // LoRA decomposition
|
|
wkv_a: Linear(args.dim, args.kv_lora_rank + args.qk_rope_head_dim),
|
|
kv_norm: RMSNorm(args.kv_lora_rank),
|
|
wkv_b: ColumnParallelLinear(args),
|
|
wo: RowParallelLinear(args),
|
|
|
|
// KV caching - optimized for memory access patterns
|
|
kv_cache: ?Tensor(f32, 4) = null, // [batch, seq_len, heads, head_dim*2]
|
|
rope_cache: ?Tensor(f32, 3) = null, // [batch, seq_len, rope_dim]
|
|
|
|
// Initialize MLA with appropriate configuration
|
|
pub fn init(allocator: std.mem.Allocator) !Self {
|
|
const head_dim = args.dim / args.n_heads;
|
|
var softmax_scale = 1.0 / std.math.sqrt(@as(f32, @floatFromInt(args.qk_nope_head_dim + args.qk_rope_head_dim)));
|
|
|
|
// Apply scaling for extended context if needed
|
|
if (args.max_seq_len > args.original_seq_len) {
|
|
const mscale = 0.1 * args.mscale * std.math.log(args.rope_factor) + 1.0;
|
|
softmax_scale *= mscale * mscale;
|
|
}
|
|
|
|
// Initialize query projection (either direct or with LoRA)
|
|
var wq: ?ColumnParallelLinear(args) = null;
|
|
var wq_a: ?Linear(args.dim, args.q_lora_rank) = null;
|
|
var q_norm: ?RMSNorm(args.q_lora_rank) = null;
|
|
var wq_b: ?ColumnParallelLinear(args) = null;
|
|
|
|
if (args.q_lora_rank == 0) {
|
|
// Standard query projection
|
|
wq = try ColumnParallelLinear(args).init(
|
|
allocator,
|
|
args.dim,
|
|
args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim),
|
|
false
|
|
);
|
|
} else {
|
|
// Low-rank adaptation for query
|
|
wq_a = try Linear(args.dim, args.q_lora_rank).init(allocator, false);
|
|
q_norm = try RMSNorm(args.q_lora_rank).init(allocator);
|
|
wq_b = try ColumnParallelLinear(args).init(
|
|
allocator,
|
|
args.q_lora_rank,
|
|
args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim),
|
|
false
|
|
);
|
|
}
|
|
|
|
// Key-value projections
|
|
var wkv_a = try Linear(args.dim, args.kv_lora_rank + args.qk_rope_head_dim).init(allocator, false);
|
|
var kv_norm = try RMSNorm(args.kv_lora_rank).init(allocator);
|
|
var wkv_b = try ColumnParallelLinear(args).init(
|
|
allocator,
|
|
args.kv_lora_rank,
|
|
args.n_heads * (args.qk_nope_head_dim + args.v_head_dim),
|
|
false
|
|
);
|
|
|
|
// Output projection
|
|
var wo = try RowParallelLinear(args).init(
|
|
allocator,
|
|
args.n_heads * args.v_head_dim,
|
|
args.dim,
|
|
false
|
|
);
|
|
|
|
return Self{
|
|
.allocator = allocator,
|
|
.dim = args.dim,
|
|
.n_heads = args.n_heads,
|
|
.head_dim = head_dim,
|
|
.q_lora_rank = args.q_lora_rank,
|
|
.kv_lora_rank = args.kv_lora_rank,
|
|
.qk_nope_head_dim = args.qk_nope_head_dim,
|
|
.qk_rope_head_dim = args.qk_rope_head_dim,
|
|
.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim,
|
|
.v_head_dim = args.v_head_dim,
|
|
.softmax_scale = softmax_scale,
|
|
.use_flash_attention = args.use_flash_attention,
|
|
.wq = wq,
|
|
.wq_a = wq_a,
|
|
.q_norm = q_norm,
|
|
.wq_b = wq_b,
|
|
.wkv_a = wkv_a,
|
|
.kv_norm = kv_norm,
|
|
.wkv_b = wkv_b,
|
|
.wo = wo,
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
if (self.wq) |*w| w.deinit();
|
|
if (self.wq_a) |*w| w.deinit();
|
|
if (self.q_norm) |*n| n.deinit();
|
|
if (self.wq_b) |*w| w.deinit();
|
|
|
|
self.wkv_a.deinit();
|
|
self.kv_norm.deinit();
|
|
self.wkv_b.deinit();
|
|
self.wo.deinit();
|
|
|
|
if (self.kv_cache) |*cache| cache.deinit();
|
|
if (self.rope_cache) |*cache| cache.deinit();
|
|
}
|
|
|
|
// Initialize KV cache for efficient inference
|
|
pub fn initKVCache(self: *Self, batch_size: usize, seq_len: usize) !void {
|
|
if (self.kv_cache != null) return;
|
|
|
|
// Allocate KV cache
|
|
self.kv_cache = try Tensor(f32, 4).init(
|
|
self.allocator,
|
|
.{batch_size, seq_len, self.n_heads, self.head_dim * 2}
|
|
);
|
|
|
|
// Zero-initialize
|
|
@memset(self.kv_cache.?.data, 0);
|
|
|
|
// Allocate rotary positional encoding cache
|
|
self.rope_cache = try Tensor(f32, 3).init(
|
|
self.allocator,
|
|
.{batch_size, seq_len, self.qk_rope_head_dim}
|
|
);
|
|
|
|
@memset(self.rope_cache.?.data, 0);
|
|
}
|
|
|
|
// Forward pass implementation with multiple specialized paths
|
|
pub fn forward(
|
|
self: *Self,
|
|
x: Tensor(f32, 3),
|
|
start_pos: usize,
|
|
freqs_cis: Tensor(f32, 3),
|
|
mask: ?Tensor(f32, 2)
|
|
) !Tensor(f32, 3) {
|
|
const batch_size = x.shape[0];
|
|
const seq_len = x.shape[1];
|
|
const end_pos = start_pos + seq_len;
|
|
|
|
// Initialize KV cache if not already done
|
|
if (start_pos > 0 and self.kv_cache == null) {
|
|
try self.initKVCache(batch_size, args.max_seq_len);
|
|
}
|
|
|
|
// Compute query vectors
|
|
var q: Tensor(f32, 4) = undefined;
|
|
if (self.q_lora_rank == 0) {
|
|
// Standard query projection
|
|
var q_flat = try self.wq.?.forward(x);
|
|
defer q_flat.deinit();
|
|
|
|
// Reshape to [batch, seq_len, heads, head_dim]
|
|
q = try q_flat.reshape(.{batch_size, seq_len, self.n_heads, self.qk_head_dim});
|
|
} else {
|
|
// Low-rank adaptation
|
|
var q_a = try self.wq_a.?.forward(x);
|
|
defer q_a.deinit();
|
|
|
|
var q_norm = try self.q_norm.?.forward(q_a);
|
|
defer q_norm.deinit();
|
|
|
|
var q_b = try self.wq_b.?.forward(q_norm);
|
|
defer q_b.deinit();
|
|
|
|
// Reshape
|
|
q = try q_b.reshape(.{batch_size, seq_len, self.n_heads, self.qk_head_dim});
|
|
}
|
|
defer q.deinit();
|
|
|
|
// Split query into regular and positional parts
|
|
var q_slices = try q.split(3, .{self.qk_nope_head_dim, self.qk_rope_head_dim});
|
|
defer for (q_slices) |*slice| slice.deinit();
|
|
|
|
var q_nope = q_slices[0];
|
|
var q_pe = q_slices[1];
|
|
|
|
// Apply rotary embeddings to position-dependent part
|
|
try applyRotaryEmbeddings(&q_pe, freqs_cis);
|
|
|
|
// Compute key-value vectors
|
|
var kv_raw = try self.wkv_a.forward(x);
|
|
defer kv_raw.deinit();
|
|
|
|
// Split into KV features and positional features
|
|
var kv_slices = try kv_raw.split(2, .{self.kv_lora_rank, self.qk_rope_head_dim});
|
|
defer for (kv_slices) |*slice| slice.deinit();
|
|
|
|
var kv_features = kv_slices[0];
|
|
var k_pe_features = kv_slices[1];
|
|
|
|
// Add batch and heads dimension to positional features
|
|
var k_pe = try k_pe_features.reshape(.{batch_size, seq_len, 1, self.qk_rope_head_dim});
|
|
defer k_pe.deinit();
|
|
|
|
// Apply rotary embeddings
|
|
try applyRotaryEmbeddings(&k_pe, freqs_cis);
|
|
|
|
// Process main KV branch
|
|
var kv_norm_features = try self.kv_norm.forward(kv_features);
|
|
defer kv_norm_features.deinit();
|
|
|
|
var kv_proj = try self.wkv_b.forward(kv_norm_features);
|
|
defer kv_proj.deinit();
|
|
|
|
// Reshape to separate K and V
|
|
var kv_reshaped = try kv_proj.reshape(
|
|
.{batch_size, seq_len, self.n_heads, self.qk_nope_head_dim + self.v_head_dim}
|
|
);
|
|
defer kv_reshaped.deinit();
|
|
|
|
// Split into K and V
|
|
var kv_parts = try kv_reshaped.split(3, .{self.qk_nope_head_dim, self.v_head_dim});
|
|
defer for (kv_parts) |*part| part.deinit();
|
|
|
|
var k_nope = kv_parts[0];
|
|
var v = kv_parts[1];
|
|
|
|
// Combine positional and non-positional key parts
|
|
var k = try combineTensors(k_nope, k_pe, 3);
|
|
defer k.deinit();
|
|
|
|
// Store in KV cache if available
|
|
if (self.kv_cache != null) {
|
|
try self.updateKVCache(k, v, start_pos, end_pos);
|
|
}
|
|
|
|
// Choose attention implementation based on settings
|
|
var attention_output: Tensor(f32, 4) = undefined;
|
|
if (self.use_flash_attention and seq_len > 1) {
|
|
attention_output = try self.computeFlashAttention(
|
|
q_nope,
|
|
q_pe,
|
|
self.kv_cache.?,
|
|
self.rope_cache.?,
|
|
mask,
|
|
batch_size,
|
|
seq_len,
|
|
end_pos
|
|
);
|
|
} else {
|
|
attention_output = try self.computeStandardAttention(
|
|
q,
|
|
k,
|
|
v,
|
|
mask,
|
|
batch_size,
|
|
seq_len,
|
|
end_pos
|
|
);
|
|
}
|
|
defer attention_output.deinit();
|
|
|
|
// Final projection
|
|
var attention_flat = try attention_output.reshape(
|
|
.{batch_size, seq_len, self.n_heads * self.v_head_dim}
|
|
);
|
|
defer attention_flat.deinit();
|
|
|
|
return self.wo.forward(attention_flat);
|
|
}
|
|
|
|
// Flash attention implementation optimized for large contexts
|
|
fn computeFlashAttention(
|
|
self: *const Self,
|
|
q_nope: Tensor(f32, 4),
|
|
q_pe: Tensor(f32, 4),
|
|
kv_cache: Tensor(f32, 4),
|
|
rope_cache: Tensor(f32, 3),
|
|
mask: ?Tensor(f32, 2),
|
|
batch_size: usize,
|
|
seq_len: usize,
|
|
end_pos: usize
|
|
) !Tensor(f32, 4) {
|
|
// Flash attention implementation with tiling to maximize cache efficiency
|
|
// This function would include a highly optimized SIMD implementation
|
|
// specializing in memory-efficient attention computation
|
|
|
|
// Note: This would be a substantial implementation with memory-efficient
|
|
// blocked matrix multiplication and careful SIMD optimization
|
|
// We're providing a simplified structure here
|
|
|
|
// For a full implementation, see the FlashAttention algorithm paper
|
|
const block_size = 32; // Block size tuned for L1 cache
|
|
|
|
// Output tensor
|
|
var output = try Tensor(f32, 4).init(
|
|
self.allocator,
|
|
.{batch_size, seq_len, self.n_heads, self.v_head_dim}
|
|
);
|
|
|
|
// Implement blocked attention algorithm...
|
|
// This would contain optimized SIMD code for tiled attention computation
|
|
|
|
return output;
|
|
}
|
|
|
|
// Standard attention for shorter sequences or when flash attention is disabled
|
|
fn computeStandardAttention(
|
|
self: *const Self,
|
|
q: Tensor(f32, 4),
|
|
k: Tensor(f32, 4),
|
|
v: Tensor(f32, 4),
|
|
mask: ?Tensor(f32, 2),
|
|
batch_size: usize,
|
|
seq_len: usize,
|
|
end_pos: usize
|
|
) !Tensor(f32, 4) {
|
|
// Compute QK attention scores
|
|
var scores = try computeAttentionScores(q, k, self.softmax_scale);
|
|
defer scores.deinit();
|
|
|
|
// Apply causal mask if provided
|
|
if (mask) |m| {
|
|
try applyAttentionMask(&scores, m);
|
|
}
|
|
|
|
// Apply softmax
|
|
try applySoftmax(&scores, -1);
|
|
|
|
// Compute attention output (scores @ v)
|
|
return computeAttentionOutput(scores, v);
|
|
}
|
|
|
|
// Update KV cache with new values
|
|
fn updateKVCache(
|
|
self: *Self,
|
|
k: Tensor(f32, 4),
|
|
v: Tensor(f32, 4),
|
|
start_pos: usize,
|
|
end_pos: usize
|
|
) !void {
|
|
const batch_size = k.shape[0];
|
|
const seq_len = k.shape[1];
|
|
|
|
// Update key cache
|
|
for (0..batch_size) |b| {
|
|
for (0..seq_len) |s| {
|
|
const cache_pos = start_pos + s;
|
|
for (0..self.n_heads) |h| {
|
|
// Copy K values
|
|
for (0..self.qk_head_dim) |d| {
|
|
const k_val = try k.at(.{b, s, h, d});
|
|
try self.kv_cache.?.set(.{b, cache_pos, h, d}, k_val);
|
|
}
|
|
|
|
// Copy V values
|
|
for (0..self.v_head_dim) |d| {
|
|
const v_val = try v.at(.{b, s, h, d});
|
|
try self.kv_cache.?.set(.{b, cache_pos, h, self.qk_head_dim + d}, v_val);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
};
|
|
}
|
|
```
|
|
|
|
**Key Optimizations:**
|
|
- **Compile-Time Specialization**: Generated attention routines are tailored to model dimensions at compile time
|
|
- **Flash Attention Algorithm**: Memory-efficient attention computation for long sequences
|
|
- **SIMD-Optimized Matrix Operations**: Vectorized attention score calculation and softmax
|
|
- **Optimized KV-Cache Layout**: Cache-friendly memory layout for efficient sequence generation
|
|
- **Sparse Attention Patterns**: Support for different attention patterns beyond standard causal attention
|
|
- **Memory Reuse**: Careful tensor management to minimize allocations during inference
|
|
- **Specialized Attention Paths**: Different implementations optimized for inference vs. training
|
|
- **Low-Rank Adaptation**: LoRA support for more efficient fine-tuning
|
|
|
|
#### 2.3 Mixture of Experts (MoE)
|
|
|
|
The Mixture of Experts (MoE) architecture is a key innovation in DeepSeek V3 that enables scaling model capacity without proportionally increasing computation cost. Our Zig implementation leverages compile-time specialization and parallel execution for maximum efficiency:
|
|
|
|
```zig
|
|
// Generic MoE implementation with compile-time specialization
|
|
pub fn MixtureOfExperts(comptime args: ModelArgs) type {
|
|
return struct {
|
|
const Self = @This();
|
|
const ModelType = args.getModelType();
|
|
|
|
// Configuration
|
|
allocator: std.mem.Allocator,
|
|
dim: usize,
|
|
n_routed_experts: usize,
|
|
n_local_experts: usize,
|
|
n_activated_experts: usize,
|
|
experts_start_idx: usize,
|
|
experts_end_idx: usize,
|
|
use_parallel_execution: bool,
|
|
|
|
// Components
|
|
gate: RouterGate(args),
|
|
experts: []Expert(args),
|
|
shared_experts: MLP(args),
|
|
thread_pool: ?*ComputeThreadPool = null,
|
|
|
|
// Initialize MoE with appropriate configuration
|
|
pub fn init(allocator: std.mem.Allocator) !Self {
|
|
// Determine expert distribution across processes
|
|
const world_size = 1; // Set to actual world size for distributed training
|
|
const rank = 0; // Set to actual rank for distributed training
|
|
|
|
std.debug.assert(args.n_routed_experts % world_size == 0,
|
|
"Number of experts must be divisible by world size");
|
|
|
|
const n_local_experts = args.n_routed_experts / world_size;
|
|
const experts_start_idx = rank * n_local_experts;
|
|
const experts_end_idx = experts_start_idx + n_local_experts;
|
|
|
|
// Initialize routing gate
|
|
var gate = try RouterGate(args).init(allocator);
|
|
errdefer gate.deinit();
|
|
|
|
// Initialize experts
|
|
var experts = try allocator.alloc(Expert(args), args.n_routed_experts);
|
|
errdefer allocator.free(experts);
|
|
|
|
// Only initialize experts that belong to this process
|
|
for (experts, 0..) |*expert, i| {
|
|
if (experts_start_idx <= i and i < experts_end_idx) {
|
|
expert.* = try Expert(args).init(allocator);
|
|
} else {
|
|
expert.* = undefined; // Not used on this process
|
|
}
|
|
}
|
|
|
|
// Initialize shared experts (always executed)
|
|
var shared_experts = try MLP(args).init(
|
|
allocator,
|
|
args.dim,
|
|
args.n_shared_experts * args.moe_inter_dim
|
|
);
|
|
errdefer shared_experts.deinit();
|
|
|
|
// Initialize thread pool for parallel execution if needed
|
|
var thread_pool: ?*ComputeThreadPool = null;
|
|
if (args.use_parallel_experts) {
|
|
thread_pool = try allocator.create(ComputeThreadPool);
|
|
const cpu_count = try std.Thread.getCpuCount();
|
|
const optimal_threads = std.math.min(
|
|
cpu_count,
|
|
args.n_activated_experts + args.n_shared_experts
|
|
);
|
|
thread_pool.?.* = try ComputeThreadPool.init(optimal_threads);
|
|
}
|
|
|
|
return Self{
|
|
.allocator = allocator,
|
|
.dim = args.dim,
|
|
.n_routed_experts = args.n_routed_experts,
|
|
.n_local_experts = n_local_experts,
|
|
.n_activated_experts = args.n_activated_experts,
|
|
.experts_start_idx = experts_start_idx,
|
|
.experts_end_idx = experts_end_idx,
|
|
.use_parallel_execution = args.use_parallel_experts,
|
|
.gate = gate,
|
|
.experts = experts,
|
|
.shared_experts = shared_experts,
|
|
.thread_pool = thread_pool,
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
self.gate.deinit();
|
|
|
|
// Only deinit experts that belong to this process
|
|
for (self.experts, 0..) |*expert, i| {
|
|
if (self.experts_start_idx <= i and i < self.experts_end_idx) {
|
|
expert.deinit();
|
|
}
|
|
}
|
|
self.allocator.free(self.experts);
|
|
|
|
self.shared_experts.deinit();
|
|
|
|
if (self.thread_pool) |pool| {
|
|
pool.deinit();
|
|
self.allocator.destroy(pool);
|
|
}
|
|
}
|
|
|
|
// Forward pass implementation with parallel expert execution
|
|
pub fn forward(self: *Self, x: Tensor(f32, 3)) !Tensor(f32, 3) {
|
|
const batch_size = x.shape[0];
|
|
const seq_len = x.shape[1];
|
|
|
|
// Reshape input for routing
|
|
var x_flat = try x.reshape(.{batch_size * seq_len, self.dim});
|
|
defer x_flat.deinit();
|
|
|
|
// Router computation
|
|
var router_output = try self.gate.forward(x_flat);
|
|
defer {
|
|
router_output.weights.deinit();
|
|
router_output.indices.deinit();
|
|
}
|
|
|
|
// Get routing weights and indices
|
|
const weights = router_output.weights;
|
|
const indices = router_output.indices;
|
|
|
|
// Initialize result tensor with zeros
|
|
var result = try Tensor(f32, 2).init(
|
|
self.allocator,
|
|
.{batch_size * seq_len, self.dim}
|
|
);
|
|
errdefer result.deinit();
|
|
|
|
@memset(result.data, 0);
|
|
|
|
// Count expert assignments for load balancing analysis
|
|
var expert_counts = try self.allocator.alloc(usize, self.n_routed_experts);
|
|
defer self.allocator.free(expert_counts);
|
|
@memset(expert_counts, 0);
|
|
|
|
for (indices.data) |idx| {
|
|
expert_counts[idx] += 1;
|
|
}
|
|
|
|
// Process each expert
|
|
if (self.use_parallel_execution and self.thread_pool != null) {
|
|
try self.parallelExpertExecution(
|
|
x_flat,
|
|
weights,
|
|
indices,
|
|
expert_counts,
|
|
&result
|
|
);
|
|
} else {
|
|
try self.sequentialExpertExecution(
|
|
x_flat,
|
|
weights,
|
|
indices,
|
|
expert_counts,
|
|
&result
|
|
);
|
|
}
|
|
|
|
// Always execute shared experts
|
|
var shared_output = try self.shared_experts.forward(x_flat);
|
|
defer shared_output.deinit();
|
|
|
|
// Add shared expert output to result
|
|
try addTensors(&result, shared_output);
|
|
|
|
// Reshape back to original dimensions
|
|
return result.reshape(.{batch_size, seq_len, self.dim});
|
|
}
|
|
|
|
// Parallel execution of experts using thread pool
|
|
fn parallelExpertExecution(
|
|
self: *Self,
|
|
x: Tensor(f32, 2),
|
|
weights: Tensor(f32, 2),
|
|
indices: Tensor(usize, 2),
|
|
expert_counts: []usize,
|
|
result: *Tensor(f32, 2)
|
|
) !void {
|
|
const thread_pool = self.thread_pool.?;
|
|
var work_queue = std.ArrayList(ExpertWorkItem).init(self.allocator);
|
|
defer work_queue.deinit();
|
|
|
|
// Create work items for each expert
|
|
for (0..self.n_routed_experts) |expert_idx| {
|
|
if (expert_counts[expert_idx] == 0) continue;
|
|
|
|
if (expert_idx < self.experts_start_idx or expert_idx >= self.experts_end_idx) {
|
|
// Skip experts not assigned to this process
|
|
continue;
|
|
}
|
|
|
|
// Extract tokens routed to this expert
|
|
var token_indices = try self.allocator.alloc(usize, expert_counts[expert_idx]);
|
|
var token_weights = try self.allocator.alloc(f32, expert_counts[expert_idx]);
|
|
|
|
var token_count: usize = 0;
|
|
for (0..x.shape[0]) |i| {
|
|
for (0..self.n_activated_experts) |j| {
|
|
const index_offset = i * self.n_activated_experts + j;
|
|
if (indices.data[index_offset] == expert_idx) {
|
|
token_indices[token_count] = i;
|
|
token_weights[token_count] = weights.data[index_offset];
|
|
token_count += 1;
|
|
}
|
|
}
|
|
}
|
|
|
|
// Create work item
|
|
try work_queue.append(.{
|
|
.allocator = self.allocator,
|
|
.expert = &self.experts[expert_idx],
|
|
.x = x,
|
|
.token_indices = token_indices,
|
|
.token_weights = token_weights,
|
|
.result = result,
|
|
.thread_pool = thread_pool,
|
|
});
|
|
}
|
|
|
|
// Schedule parallel expert execution
|
|
for (work_queue.items) |*work_item| {
|
|
// Increment completion counter
|
|
_ = thread_pool.completion_count.fetchAdd(1, .Release);
|
|
|
|
// Submit task to thread pool
|
|
try thread_pool.compute(processExpertWork, work_item);
|
|
}
|
|
|
|
// Wait for all expert computations to complete
|
|
thread_pool.waitAll();
|
|
}
|
|
|
|
// Sequential execution of experts
|
|
fn sequentialExpertExecution(
|
|
self: *Self,
|
|
x: Tensor(f32, 2),
|
|
weights: Tensor(f32, 2),
|
|
indices: Tensor(usize, 2),
|
|
expert_counts: []usize,
|
|
result: *Tensor(f32, 2)
|
|
) !void {
|
|
// Process each expert sequentially
|
|
for (0..self.n_routed_experts) |expert_idx| {
|
|
if (expert_counts[expert_idx] == 0) continue;
|
|
|
|
if (expert_idx < self.experts_start_idx or expert_idx >= self.experts_end_idx) {
|
|
// Skip experts not assigned to this process
|
|
continue;
|
|
}
|
|
|
|
// Get tokens assigned to this expert
|
|
for (0..x.shape[0]) |i| {
|
|
for (0..self.n_activated_experts) |j| {
|
|
const index_offset = i * self.n_activated_experts + j;
|
|
if (indices.data[index_offset] == expert_idx) {
|
|
// Process token with this expert
|
|
const token_weight = weights.data[index_offset];
|
|
|
|
// Extract input token
|
|
var token_input = try x.slice(.{i, 0}, .{i + 1, self.dim});
|
|
defer token_input.deinit();
|
|
|
|
// Process through expert
|
|
var expert_output = try self.experts[expert_idx].forward(token_input);
|
|
defer expert_output.deinit();
|
|
|
|
// Scale by routing weight
|
|
try scaleTensor(&expert_output, token_weight);
|
|
|
|
// Add to result
|
|
for (0..self.dim) |d| {
|
|
result.data[i * self.dim + d] += expert_output.data[d];
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
// Worker task for parallel expert execution
|
|
const ExpertWorkItem = struct {
|
|
allocator: std.mem.Allocator,
|
|
expert: *Expert(args),
|
|
x: Tensor(f32, 2),
|
|
token_indices: []usize,
|
|
token_weights: []f32,
|
|
result: *Tensor(f32, 2),
|
|
thread_pool: *ComputeThreadPool,
|
|
};
|
|
|
|
fn processExpertWork(ctx_ptr: *anyopaque) void {
|
|
const ctx = @ptrCast(*ExpertWorkItem, @alignCast(@alignOf(ExpertWorkItem), ctx_ptr));
|
|
defer {
|
|
ctx.allocator.free(ctx.token_indices);
|
|
ctx.allocator.free(ctx.token_weights);
|
|
_ = ctx.thread_pool.completion_count.fetchSub(1, .Release);
|
|
}
|
|
|
|
// Process each token assigned to this expert
|
|
for (ctx.token_indices, ctx.token_weights, 0..) |token_idx, weight, i| {
|
|
// Extract input token
|
|
var token_input = ctx.x.slice(.{token_idx, 0}, .{token_idx + 1, ctx.x.shape[1]}) catch return;
|
|
defer token_input.deinit();
|
|
|
|
// Process through expert
|
|
var expert_output = ctx.expert.forward(token_input) catch return;
|
|
defer expert_output.deinit();
|
|
|
|
// Scale by routing weight
|
|
scaleTensor(&expert_output, weight) catch return;
|
|
|
|
// Add to result (using atomic operations to avoid race conditions)
|
|
for (0..expert_output.shape[1]) |d| {
|
|
const offset = token_idx * expert_output.shape[1] + d;
|
|
const old_val = @atomicLoad(f32, &ctx.result.data[offset], .Acquire);
|
|
const new_val = old_val + expert_output.data[d];
|
|
@atomicStore(f32, &ctx.result.data[offset], new_val, .Release);
|
|
}
|
|
}
|
|
}
|
|
};
|
|
}
|
|
|
|
// Router gate for MoE that determines which experts to use for each token
|
|
pub fn RouterGate(comptime args: ModelArgs) type {
|
|
return struct {
|
|
const Self = @This();
|
|
|
|
allocator: std.mem.Allocator,
|
|
dim: usize,
|
|
n_experts: usize,
|
|
n_groups: usize,
|
|
n_limited_groups: usize,
|
|
topk: usize,
|
|
score_func: enum { softmax, sigmoid },
|
|
route_scale: f32,
|
|
|
|
// Router weights
|
|
weight: Tensor(f32, 2),
|
|
bias: ?Tensor(f32, 1) = null,
|
|
|
|
pub fn init(allocator: std.mem.Allocator) !Self {
|
|
var weight = try Tensor(f32, 2).init(
|
|
allocator,
|
|
.{args.n_routed_experts, args.dim}
|
|
);
|
|
|
|
// Initialize with appropriate distribution
|
|
try initializeParameters(&weight, 0.0, 0.02);
|
|
|
|
// Create optional bias
|
|
var bias: ?Tensor(f32, 1) = null;
|
|
if (args.dim == 7168) { // Special case for bias
|
|
bias = try Tensor(f32, 1).init(allocator, .{args.n_routed_experts});
|
|
@memset(bias.?.data, 0);
|
|
}
|
|
|
|
return Self{
|
|
.allocator = allocator,
|
|
.dim = args.dim,
|
|
.n_experts = args.n_routed_experts,
|
|
.n_groups = args.n_expert_groups,
|
|
.n_limited_groups = args.n_limited_groups,
|
|
.topk = args.n_activated_experts,
|
|
.score_func = args.score_func,
|
|
.route_scale = args.route_scale,
|
|
.weight = weight,
|
|
.bias = bias,
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
self.weight.deinit();
|
|
if (self.bias) |*b| b.deinit();
|
|
}
|
|
|
|
// Router forward pass to determine expert assignment
|
|
pub fn forward(self: *const Self, x: Tensor(f32, 2)) !RouterOutput {
|
|
// Compute routing scores
|
|
var scores = try linearProjection(x, self.weight, self.bias);
|
|
defer scores.deinit();
|
|
|
|
// Apply scoring function
|
|
var routing_probs: Tensor(f32, 2) = undefined;
|
|
if (self.score_func == .softmax) {
|
|
routing_probs = try applySoftmax(scores, 1);
|
|
} else {
|
|
routing_probs = try applySigmoid(scores);
|
|
}
|
|
defer routing_probs.deinit();
|
|
|
|
// Save original scores for later
|
|
var original_scores = try routing_probs.clone();
|
|
|
|
// Expert group handling
|
|
if (self.n_groups > 1) {
|
|
try self.applyGroupFiltering(&routing_probs);
|
|
}
|
|
|
|
// Select top-k experts
|
|
var indices = try Tensor(usize, 2).init(
|
|
self.allocator,
|
|
.{x.shape[0], self.topk}
|
|
);
|
|
|
|
var weights = try Tensor(f32, 2).init(
|
|
self.allocator,
|
|
.{x.shape[0], self.topk}
|
|
);
|
|
|
|
try self.selectTopkExperts(routing_probs, original_scores, &indices, &weights);
|
|
|
|
// Apply routing scale
|
|
if (self.route_scale != 1.0) {
|
|
try scaleTensor(&weights, self.route_scale);
|
|
}
|
|
|
|
return RouterOutput{
|
|
.weights = weights,
|
|
.indices = indices,
|
|
};
|
|
}
|
|
|
|
// Apply expert group filtering
|
|
fn applyGroupFiltering(self: *const Self, scores: *Tensor(f32, 2)) !void {
|
|
// Reshape scores for group processing
|
|
const batch_size = scores.shape[0];
|
|
const experts_per_group = self.n_experts / self.n_groups;
|
|
|
|
var reshaped_scores = try scores.reshape(
|
|
.{batch_size, self.n_groups, experts_per_group}
|
|
);
|
|
defer reshaped_scores.deinit();
|
|
|
|
// Compute group scores
|
|
var group_scores = try Tensor(f32, 2).init(
|
|
self.allocator,
|
|
.{batch_size, self.n_groups}
|
|
);
|
|
defer group_scores.deinit();
|
|
|
|
// Calculate score for each group
|
|
if (self.bias == null) {
|
|
// Use max score as group score
|
|
for (0..batch_size) |b| {
|
|
for (0..self.n_groups) |g| {
|
|
var max_score: f32 = -std.math.inf_f32;
|
|
for (0..experts_per_group) |e| {
|
|
const score = try reshaped_scores.at(.{b, g, e});
|
|
if (score > max_score) max_score = score;
|
|
}
|
|
try group_scores.set(.{b, g}, max_score);
|
|
}
|
|
}
|
|
} else {
|
|
// Use sum of top-2 scores as group score
|
|
for (0..batch_size) |b| {
|
|
for (0..self.n_groups) |g| {
|
|
var scores_arr = try self.allocator.alloc(f32, experts_per_group);
|
|
defer self.allocator.free(scores_arr);
|
|
|
|
// Extract scores for this group
|
|
for (0..experts_per_group) |e| {
|
|
scores_arr[e] = try reshaped_scores.at(.{b, g, e});
|
|
}
|
|
|
|
// Sort to find top-2
|
|
std.sort.sort(f32, scores_arr, {}, std.sort.desc(f32));
|
|
|
|
// Sum top-2 scores
|
|
const group_score = scores_arr[0] + scores_arr[1];
|
|
try group_scores.set(.{b, g}, group_score);
|
|
}
|
|
}
|
|
}
|
|
|
|
// Find top-k groups
|
|
var top_groups = try Tensor(usize, 2).init(
|
|
self.allocator,
|
|
.{batch_size, self.n_limited_groups}
|
|
);
|
|
defer top_groups.deinit();
|
|
|
|
// Select top-k groups
|
|
for (0..batch_size) |b| {
|
|
var scores_arr = try self.allocator.alloc(struct { score: f32, idx: usize }, self.n_groups);
|
|
defer self.allocator.free(scores_arr);
|
|
|
|
// Prepare for sorting
|
|
for (0..self.n_groups) |g| {
|
|
scores_arr[g] = .{
|
|
.score = try group_scores.at(.{b, g}),
|
|
.idx = g,
|
|
};
|
|
}
|
|
|
|
// Sort by score
|
|
const Sort = struct {
|
|
fn desc(context: void, a: anytype, b: anytype) bool {
|
|
return a.score > b.score;
|
|
}
|
|
};
|
|
std.sort.sort(struct { score: f32, idx: usize }, scores_arr, {}, Sort.desc);
|
|
|
|
// Store top-k group indices
|
|
for (0..self.n_limited_groups) |i| {
|
|
try top_groups.set(.{b, i}, scores_arr[i].idx);
|
|
}
|
|
}
|
|
|
|
// Create mask for filtering
|
|
var mask = try Tensor(bool, 3).init(
|
|
self.allocator,
|
|
.{batch_size, self.n_groups, 1}
|
|
);
|
|
defer mask.deinit();
|
|
|
|
// Initialize all groups as masked (excluded)
|
|
@memset(mask.data, true);
|
|
|
|
// Unmask top groups
|
|
for (0..batch_size) |b| {
|
|
for (0..self.n_limited_groups) |i| {
|
|
const g = try top_groups.at(.{b, i});
|
|
try mask.set(.{b, g, 0}, false);
|
|
}
|
|
}
|
|
|
|
// Apply mask
|
|
for (0..batch_size) |b| {
|
|
for (0..self.n_groups) |g| {
|
|
const is_masked = try mask.at(.{b, g, 0});
|
|
if (is_masked) {
|
|
// Mask out this group by setting scores to -inf
|
|
for (0..experts_per_group) |e| {
|
|
try reshaped_scores.set(.{b, g, e}, -std.math.inf_f32);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
// Reshape back to original shape
|
|
try scores.copyFrom(reshaped_scores.reshape(.{batch_size, self.n_experts}) catch unreachable);
|
|
}
|
|
|
|
// Select top-k experts based on routing scores
|
|
fn selectTopkExperts(
|
|
self: *const Self,
|
|
scores: Tensor(f32, 2),
|
|
original_scores: Tensor(f32, 2),
|
|
indices: *Tensor(usize, 2),
|
|
weights: *Tensor(f32, 2)
|
|
) !void {
|
|
const batch_size = scores.shape[0];
|
|
|
|
for (0..batch_size) |b| {
|
|
var scores_arr = try self.allocator.alloc(struct { score: f32, idx: usize }, self.n_experts);
|
|
defer self.allocator.free(scores_arr);
|
|
|
|
// Prepare for sorting
|
|
for (0..self.n_experts) |e| {
|
|
scores_arr[e] = .{
|
|
.score = try scores.at(.{b, e}),
|
|
.idx = e,
|
|
};
|
|
}
|
|
|
|
// Sort by score
|
|
const Sort = struct {
|
|
fn desc(context: void, a: anytype, b: anytype) bool {
|
|
return a.score > b.score;
|
|
}
|
|
};
|
|
std.sort.sort(struct { score: f32, idx: usize }, scores_arr, {}, Sort.desc);
|
|
|
|
// Store top-k indices and get weights from original scores
|
|
for (0..self.topk) |i| {
|
|
const expert_idx = scores_arr[i].idx;
|
|
try indices.set(.{b, i}, expert_idx);
|
|
|
|
// Get weight from original scores
|
|
const weight = try original_scores.at(.{b, expert_idx});
|
|
try weights.set(.{b, i}, weight);
|
|
}
|
|
|
|
// Normalize weights for sigmoid scoring
|
|
if (self.score_func == .sigmoid) {
|
|
var sum: f32 = 0.0;
|
|
for (0..self.topk) |i| {
|
|
sum += try weights.at(.{b, i});
|
|
}
|
|
|
|
if (sum > 0.0) {
|
|
for (0..self.topk) |i| {
|
|
const w = try weights.at(.{b, i});
|
|
try weights.set(.{b, i}, w / sum);
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
};
|
|
}
|
|
|
|
// Output from router gate
|
|
pub const RouterOutput = struct {
|
|
weights: Tensor(f32, 2), // [batch_size, topk]
|
|
indices: Tensor(usize, 2), // [batch_size, topk]
|
|
};
|
|
```
|
|
|
|
**Key Features:**
|
|
- **Compile-Time Specialization**: Generated MoE implementation tailored to model dimensions and configuration
|
|
- **Parallel Expert Execution**: Efficient multi-threading with work distribution and load balancing
|
|
- **Atomic Operations**: Thread-safe updates to shared tensors
|
|
- **Group-Based Routing**: Optimized implementation of expert groups for more efficient routing
|
|
- **Memory-Efficient Tensor Management**: Careful handling of temporary allocations
|
|
- **Flexible Scoring Functions**: Support for both softmax and sigmoid routing
|
|
- **Expert Load Balancing**: Runtime tracking of expert utilization
|
|
- **Distributed Expert Sharding**: Support for distributing experts across multiple processes
|
|
|
|
### 3. Computation Backend
|
|
|
|
Outlining the computation backend architecture for the DeepSeek-V3 project implemented in Zig. The design emphasizes performance, modularity, and hardware portability.
|
|
|
|
#### 3.1 Backend Interface
|
|
|
|
The backend interface provides a unified abstraction layer for all computation targets while maintaining Zig's zero-cost abstraction philosophy.
|
|
|
|
```zig
|
|
pub const ComputeBackend = struct {
|
|
const Self = @This();
|
|
|
|
// Function pointers for backend-specific operations with proper type safety
|
|
matmulFn: *const fn(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) anyerror!void,
|
|
softmaxFn: *const fn(x: anytype, dim: usize, allocator: std.mem.Allocator) anyerror!void,
|
|
rmsnormFn: *const fn(x: anytype, weight: anytype, eps: f32, allocator: std.mem.Allocator) anyerror!void,
|
|
attentionFn: *const fn(q: anytype, k: anytype, v: anytype, mask: ?anytype, scale: f32, allocator: std.mem.Allocator) anyerror!void,
|
|
// Other operations...
|
|
|
|
// Configuration for the backend
|
|
config: BackendConfig,
|
|
|
|
// Dispatch based on backend type with proper error handling
|
|
pub fn matmul(self: *const Self, a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void {
|
|
return self.matmulFn(a, b, c, allocator);
|
|
}
|
|
|
|
pub fn softmax(self: *const Self, x: anytype, dim: usize, allocator: std.mem.Allocator) !void {
|
|
return self.softmaxFn(x, dim, allocator);
|
|
}
|
|
|
|
pub fn rmsnorm(self: *const Self, x: anytype, weight: anytype, eps: f32, allocator: std.mem.Allocator) !void {
|
|
return self.rmsnormFn(x, weight, eps, allocator);
|
|
}
|
|
|
|
pub fn attention(self: *const Self, q: anytype, k: anytype, v: anytype, mask: ?anytype, scale: f32, allocator: std.mem.Allocator) !void {
|
|
return self.attentionFn(q, k, v, mask, scale, allocator);
|
|
}
|
|
};
|
|
```
|
|
|
|
#### 3.2 Platform-Specific Implementations
|
|
|
|
```zig
|
|
pub const CPUBackend = struct {
|
|
allocator: std.mem.Allocator,
|
|
thread_pool: ?*ThreadPool,
|
|
|
|
pub fn init(allocator: std.mem.Allocator, thread_count: ?usize) !ComputeBackend {
|
|
const thread_pool = if (thread_count) |count| {
|
|
try ThreadPool.init(allocator, .{ .thread_count = count });
|
|
} else null;
|
|
|
|
return ComputeBackend{
|
|
.matmulFn = cpuMatmul,
|
|
.softmaxFn = cpuSoftmax,
|
|
.rmsnormFn = cpuRmsnorm,
|
|
.attentionFn = cpuAttention,
|
|
// Other operations...
|
|
.config = BackendConfig{
|
|
.backend_type = .Cpu,
|
|
.max_threads = thread_count,
|
|
// Other CPU-specific config...
|
|
},
|
|
};
|
|
}
|
|
|
|
fn cpuMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void {
|
|
// Dynamically select the optimal implementation based on matrix dimensions and CPU features
|
|
if (c.rows * c.cols > 1024 * 1024 and detectCpuFeatures().use_avx2) {
|
|
return cpuMatmulParallel(a, b, c, allocator);
|
|
}
|
|
return cpuMatmulSIMD(a, b, c, allocator);
|
|
}
|
|
|
|
fn cpuSoftmax(x: anytype, dim: usize, allocator: std.mem.Allocator) !void {
|
|
// Optimized CPU implementation using SIMD
|
|
// Implementation details...
|
|
}
|
|
|
|
// Other CPU-specific implementations...
|
|
};
|
|
|
|
pub const MetalBackend = struct {
|
|
device: *MTLDevice,
|
|
command_queue: *MTLCommandQueue,
|
|
library: *MTLLibrary,
|
|
allocator: std.mem.Allocator,
|
|
pipelines: PipelineCache,
|
|
|
|
pub fn init(allocator: std.mem.Allocator) !ComputeBackend {
|
|
// Initialize Metal device, command queue, and library
|
|
const device = MTLCreateSystemDefaultDevice() orelse return error.MetalDeviceNotAvailable;
|
|
const command_queue = device.newCommandQueue() orelse return error.CommandQueueCreationFailed;
|
|
|
|
// Load compute shaders from embedded metal code or compiled library
|
|
const library = try loadDefaultLibrary(device);
|
|
|
|
// Initialize pipeline cache
|
|
var pipelines = PipelineCache.init(allocator);
|
|
try pipelines.precompileEssentialPipelines(device, library);
|
|
|
|
return ComputeBackend{
|
|
.matmulFn = metalMatmul,
|
|
.softmaxFn = metalSoftmax,
|
|
.rmsnormFn = metalRmsnorm,
|
|
.attentionFn = metalAttention,
|
|
// Other operations...
|
|
.config = BackendConfig{
|
|
.backend_type = .Metal,
|
|
.workgroup_size = .{16, 16, 1},
|
|
.shared_memory_size = 32 * 1024,
|
|
// Other Metal-specific config...
|
|
},
|
|
};
|
|
}
|
|
|
|
fn metalMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void {
|
|
// Implementation using Metal Performance Shaders when available
|
|
// Fallback to custom compute kernel for specialized operations
|
|
// Implementation details...
|
|
}
|
|
|
|
fn metalSoftmax(x: anytype, dim: usize, allocator: std.mem.Allocator) !void {
|
|
// Metal implementation
|
|
// Implementation details...
|
|
}
|
|
|
|
// Other Metal-specific implementations...
|
|
};
|
|
```
|
|
|
|
**Key Features:**
|
|
- Abstract interface with compile-time type safety
|
|
- Proper error handling with Zig's error system
|
|
- Zero-cost abstraction for backend dispatch
|
|
- Dynamic backend selection based on available hardware
|
|
- Specialized implementations for different hardware architectures
|
|
- Thread pool integration for CPU parallelism
|
|
- Resource management for GPU backends
|
|
- Pipeline caching for improved performance
|
|
|
|
|
|
#### 3.3 SIMD Vectorization
|
|
|
|
DeepSeek-V3 leverages Zig's built-in vector types to achieve high-performance computation across different architectures.
|
|
|
|
```zig
|
|
// Define vector types with architecture-specific sizes
|
|
pub fn VectorType(comptime T: type, comptime len: usize) type {
|
|
return @Vector(len, T);
|
|
}
|
|
|
|
// Compile-time determination of optimal vector size
|
|
pub fn getOptimalVectorSize(comptime T: type) usize {
|
|
const target = @import("builtin").target;
|
|
|
|
// Determine vector size based on architecture and data type
|
|
if (T == f32) {
|
|
if (target.cpu.arch == .x86_64 or target.cpu.arch == .x86) {
|
|
if (target.cpu.features.isEnabled(.avx512f)) {
|
|
return 16; // 512 bits / 32 bits = 16 elements
|
|
} else if (target.cpu.features.isEnabled(.avx2)) {
|
|
return 8; // 256 bits / 32 bits = 8 elements
|
|
} else if (target.cpu.features.isEnabled(.sse4_1)) {
|
|
return 4; // 128 bits / 32 bits = 4 elements
|
|
}
|
|
} else if (target.cpu.arch == .aarch64) {
|
|
if (target.cpu.features.isEnabled(.neon)) {
|
|
return 4; // 128 bits / 32 bits = 4 elements
|
|
}
|
|
}
|
|
} else if (T == f16) {
|
|
// Similar logic for f16 with doubled vector sizes
|
|
// ...
|
|
}
|
|
|
|
// Default fallback
|
|
return 4;
|
|
}
|
|
|
|
// Example of SIMD matrix multiplication
|
|
pub fn matrixMultiplySIMD(comptime T: type, a: []const T, b: []const T, c: []T, m: usize, n: usize, k: usize) void {
|
|
const vec_size = comptime getOptimalVectorSize(T);
|
|
const Vec = VectorType(T, vec_size);
|
|
|
|
// Process blocks that align with vector size
|
|
const k_vec = k / vec_size * vec_size;
|
|
|
|
for (0..m) |i| {
|
|
for (0..n) |j| {
|
|
var sum: T = 0;
|
|
var vec_sum: Vec = @splat(0);
|
|
|
|
// Vector part
|
|
var kv: usize = 0;
|
|
while (kv < k_vec) : (kv += vec_size) {
|
|
const a_vec = blk: {
|
|
var tmp: Vec = undefined;
|
|
for (0..vec_size) |v| {
|
|
tmp[v] = a[i * k + kv + v];
|
|
}
|
|
break :blk tmp;
|
|
};
|
|
|
|
const b_vec = blk: {
|
|
var tmp: Vec = undefined;
|
|
for (0..vec_size) |v| {
|
|
tmp[v] = b[kv + v + j * k];
|
|
}
|
|
break :blk tmp;
|
|
};
|
|
|
|
vec_sum += a_vec * b_vec;
|
|
}
|
|
|
|
// Reduce vector
|
|
for (0..vec_size) |v| {
|
|
sum += vec_sum[v];
|
|
}
|
|
|
|
// Remaining elements
|
|
for (k_vec..k) |kk| {
|
|
sum += a[i * k + kk] * b[kk + j * k];
|
|
}
|
|
|
|
c[i * n + j] = sum;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 3.4 Runtime CPU Feature Detection
|
|
|
|
```zig
|
|
pub fn detectCpuFeatures() BackendConfig {
|
|
var config = BackendConfig{
|
|
.backend_type = BackendType.Cpu,
|
|
};
|
|
|
|
// Try to detect CPU features at runtime
|
|
const cpu_info = std.zig.system.getCpuInfo() catch {
|
|
// Fallback to safe defaults if detection fails
|
|
return config;
|
|
};
|
|
|
|
// Configure based on detected features
|
|
config.use_avx512 = cpu_info.features.isEnabled(.avx512f);
|
|
config.use_avx2 = cpu_info.features.isEnabled(.avx2);
|
|
config.use_sse4_1 = cpu_info.features.isEnabled(.sse4_1);
|
|
config.use_neon = cpu_info.features.isEnabled(.neon);
|
|
|
|
return config;
|
|
}
|
|
```
|
|
|
|
#### 3.5 Backend Configuration
|
|
|
|
Backend configuration allows fine-tuning performance characteristics based on hardware capabilities and workload requirements.
|
|
|
|
```zig
|
|
pub const BackendType = enum {
|
|
Cpu,
|
|
Cuda,
|
|
Metal,
|
|
Vulkan,
|
|
WebGPU,
|
|
};
|
|
|
|
pub const BackendConfig = struct {
|
|
backend_type: BackendType,
|
|
max_threads: ?usize = null,
|
|
cache_line_size: usize = 64, // Default x86-64 cache line size
|
|
use_avx512: bool = false, // Use AVX-512 when available
|
|
use_avx2: bool = true, // Use AVX2 when available
|
|
use_sse4_1: bool = true, // Use SSE4.1 when available
|
|
use_neon: bool = false, // Use ARM NEON when available
|
|
prefetch_distance: usize = 8, // Prefetch N cache lines ahead
|
|
tiling_size: ?[2]usize = null, // Matrix tiling dimensions
|
|
batch_size: ?usize = null, // Batch size for kernel operations
|
|
memory_pool_size: ?usize = null, // Size of pre-allocated memory pool
|
|
use_half_precision: bool = false, // Use FP16 where appropriate
|
|
use_mixed_precision: bool = true, // Use mixed precision for matmul
|
|
|
|
// GPU-specific options
|
|
workgroup_size: ?[3]usize = null, // GPU workgroup dimensions
|
|
shared_memory_size: ?usize = null, // GPU shared memory allocation
|
|
compute_queue_depth: usize = 3, // Maximum concurrent compute operations
|
|
};
|
|
```
|
|
|
|
#### 3.6 GPU Integration
|
|
|
|
DeepSeek-V3 supports multiple GPU backends, with specialized implementations for each platform.
|
|
|
|
#### 3.6.1 CUDA Backend
|
|
|
|
```zig
|
|
pub const CudaBackend = struct {
|
|
allocator: std.mem.Allocator,
|
|
device: i32,
|
|
stream: ?*anyopaque,
|
|
handles: CudaHandles,
|
|
module_cache: ModuleCache,
|
|
|
|
pub fn init(allocator: std.mem.Allocator, device_id: ?i32) !ComputeBackend {
|
|
// Initialize CUDA device, context, and stream
|
|
const device = if (device_id) |id| id else try getOptimalCudaDevice();
|
|
try cudaSetDevice(device);
|
|
|
|
var stream: ?*anyopaque = null;
|
|
try checkCudaStatus(cudaStreamCreate(&stream));
|
|
|
|
// Initialize cuBLAS and cuDNN handles
|
|
var handles = try CudaHandles.init(stream);
|
|
|
|
// Compile and cache essential CUDA kernels
|
|
var module_cache = try ModuleCache.init(allocator);
|
|
try module_cache.compileEssentialKernels();
|
|
|
|
return ComputeBackend{
|
|
.matmulFn = cudaMatmul,
|
|
.softmaxFn = cudaSoftmax,
|
|
.rmsnormFn = cudaRmsnorm,
|
|
.attentionFn = cudaAttention,
|
|
// Other operations...
|
|
.config = BackendConfig{
|
|
.backend_type = .Cuda,
|
|
.workgroup_size = .{16, 16, 1},
|
|
.shared_memory_size = 48 * 1024,
|
|
// Other CUDA-specific config...
|
|
},
|
|
};
|
|
}
|
|
|
|
fn cudaMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void {
|
|
// Use cuBLAS for large matrices
|
|
// Fall back to custom kernels for specialized operations
|
|
// Implementation details...
|
|
}
|
|
|
|
// Other CUDA-specific implementations...
|
|
};
|
|
```
|
|
|
|
#### 3.6.2 Vulkan Backend
|
|
|
|
```zig
|
|
pub const VulkanBackend = struct {
|
|
allocator: std.mem.Allocator,
|
|
instance: vk.Instance,
|
|
physical_device: vk.PhysicalDevice,
|
|
device: vk.Device,
|
|
compute_queue: vk.Queue,
|
|
command_pool: vk.CommandPool,
|
|
pipeline_cache: vk.PipelineCache,
|
|
shader_modules: ShaderModuleCache,
|
|
|
|
pub fn init(allocator: std.mem.Allocator) !ComputeBackend {
|
|
// Initialize Vulkan instance, device, and queues
|
|
// Implementation details...
|
|
|
|
return ComputeBackend{
|
|
.matmulFn = vulkanMatmul,
|
|
.softmaxFn = vulkanSoftmax,
|
|
.rmsnormFn = vulkanRmsnorm,
|
|
.attentionFn = vulkanAttention,
|
|
// Other operations...
|
|
.config = BackendConfig{
|
|
.backend_type = .Vulkan,
|
|
// Vulkan-specific config...
|
|
},
|
|
};
|
|
}
|
|
|
|
// Vulkan-specific implementations...
|
|
};
|
|
```
|
|
|
|
#### 3.7 Quantization Framework
|
|
|
|
The quantization framework enables efficient model deployment through reduced precision arithmetic.
|
|
|
|
```zig
|
|
// Supported quantization methods
|
|
pub const QuantizationMethod = enum {
|
|
None,
|
|
FP16, // Half precision
|
|
Int8, // 8-bit integer quantization
|
|
Int4, // 4-bit integer quantization
|
|
NF4, // NormalFloat4 quantization
|
|
GPTQ, // GPTQ quantization
|
|
AWQ, // Activation-aware weight quantization
|
|
};
|
|
|
|
// Quantization configuration
|
|
pub const QuantConfig = struct {
|
|
method: QuantizationMethod = .None,
|
|
scale_type: ?type = null, // Type for quantization scales
|
|
group_size: usize = 128, // Size of quantization groups
|
|
bits: u8 = 8, // Bits per quantized value
|
|
symmetric: bool = false, // Symmetric vs asymmetric quantization
|
|
|
|
// Calibration parameters
|
|
calibration_dataset: ?[]const u8 = null,
|
|
num_calibration_samples: usize = 128,
|
|
|
|
// Sparsity options
|
|
use_sparse: bool = false,
|
|
sparsity_threshold: f32 = 0.01,
|
|
};
|
|
|
|
// Abstract quantizer interface
|
|
pub const Quantizer = struct {
|
|
const Self = @This();
|
|
|
|
quantizeFn: *const fn(self: *Self, tensor: Tensor, config: QuantConfig, allocator: std.mem.Allocator) anyerror!Tensor,
|
|
dequantizeFn: *const fn(self: *Self, tensor: Tensor, allocator: std.mem.Allocator) anyerror!Tensor,
|
|
|
|
pub fn quantize(self: *Self, tensor: Tensor, config: QuantConfig, allocator: std.mem.Allocator) !Tensor {
|
|
return self.quantizeFn(self, tensor, config, allocator);
|
|
}
|
|
|
|
pub fn dequantize(self: *Self, tensor: Tensor, allocator: std.mem.Allocator) !Tensor {
|
|
return self.dequantizeFn(self, tensor, allocator);
|
|
}
|
|
};
|
|
```
|
|
|
|
#### 3.8 Memory Management
|
|
|
|
Efficient memory management is crucial for large language model inference.
|
|
|
|
```zig
|
|
// Memory allocation strategy
|
|
pub const AllocStrategy = enum {
|
|
Default, // Standard allocator
|
|
Arena, // Arena allocator for bulk allocations
|
|
Pool, // Memory pool for fixed-size allocations
|
|
Streaming, // Streaming allocator for pipelined operations
|
|
Pinned, // Pinned memory for efficient host-device transfers
|
|
};
|
|
|
|
// Memory pool for efficient tensor allocations
|
|
pub const TensorMemoryPool = struct {
|
|
const Self = @This();
|
|
|
|
parent_allocator: std.mem.Allocator,
|
|
pool: std.heap.MemoryPool,
|
|
block_sizes: []const usize,
|
|
blocks: std.AutoArrayHashMap(usize, std.ArrayList(*anyopaque)),
|
|
mutex: std.Thread.Mutex,
|
|
stats: MemoryStats,
|
|
|
|
pub fn init(allocator: std.mem.Allocator, config: MemoryPoolConfig) !Self {
|
|
// Initialize memory pool with predefined block sizes
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn allocate(self: *Self, size: usize, alignment: usize) ![]u8 {
|
|
// Find the appropriate block size or allocate directly
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn free(self: *Self, ptr: []u8) void {
|
|
// Return to pool or free directly
|
|
// Implementation details...
|
|
}
|
|
|
|
// Memory management utilities
|
|
pub fn preallocate(self: *Self, block_size: usize, count: usize) !void {
|
|
// Preallocate multiple blocks of the specified size
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn reclaim(self: *Self) void {
|
|
// Reclaim unused memory blocks
|
|
// Implementation details...
|
|
}
|
|
};
|
|
|
|
// Key-Value cache management for efficient inference
|
|
pub const KVCache = struct {
|
|
allocator: std.mem.Allocator,
|
|
k_cache: Tensor,
|
|
v_cache: Tensor,
|
|
capacity: usize,
|
|
size: usize,
|
|
head_dim: usize,
|
|
num_heads: usize,
|
|
|
|
pub fn init(allocator: std.mem.Allocator, batch_size: usize, num_heads: usize, head_dim: usize, max_seq_len: usize) !Self {
|
|
// Initialize key-value cache with appropriate dimensions
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn append(self: *Self, k: Tensor, v: Tensor, pos: usize) !void {
|
|
// Append new key-value pairs to the cache
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn prefill(self: *Self, k: Tensor, v: Tensor) !void {
|
|
// Prefill the cache with initial key-value pairs
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn rotatePositions(self: *Self, positions: []const usize) !void {
|
|
// Rearrange cache entries based on position IDs (for speculative decoding)
|
|
// Implementation details...
|
|
}
|
|
|
|
pub fn clear(self: *Self) void {
|
|
// Reset the cache size without deallocating memory
|
|
// Implementation details...
|
|
}
|
|
};
|
|
```
|
|
|
|
#### 3.9 Metal Integration for Apple Silicon
|
|
|
|
Modern Apple Silicon devices offer exceptional compute performance, and our Zig implementation takes full advantage of these capabilities through direct Metal API integration:
|
|
|
|
```zig
|
|
pub const MetalBackend = struct {
|
|
const Self = @This();
|
|
|
|
// Core Metal resources
|
|
device: *MTLDevice,
|
|
command_queue: *MTLCommandQueue,
|
|
library: *MTLLibrary,
|
|
|
|
// Pipeline cache for reusing compiled compute pipelines
|
|
pipeline_cache: std.AutoHashMap(u64, *MTLComputePipelineState),
|
|
|
|
// Memory management
|
|
allocator: std.mem.Allocator,
|
|
buffer_pool: BufferPool,
|
|
|
|
// Configuration and statistics
|
|
config: BackendConfig,
|
|
stats: MetalStatistics,
|
|
|
|
pub fn init(allocator: std.mem.Allocator) !*Self {
|
|
// Get the default Metal device
|
|
var device = MTLCreateSystemDefaultDevice();
|
|
if (device == null) return error.MetalDeviceNotAvailable;
|
|
|
|
// Create a command queue for submitting work to the GPU
|
|
var command_queue = device.?.newCommandQueue();
|
|
if (command_queue == null) return error.MetalCommandQueueCreationFailed;
|
|
|
|
// Compile our Metal shader library from source or load precompiled metallib
|
|
var library: ?*MTLLibrary = null;
|
|
if (comptime @import("builtin").mode == .Debug) {
|
|
// Compile from source for easier debugging
|
|
library = try compileLibraryFromSource(device.?, shader_source);
|
|
} else {
|
|
// Use precompiled metallib for release builds
|
|
const metallib_path = try findMetalLibPath(allocator);
|
|
defer allocator.free(metallib_path);
|
|
|
|
library = try loadCompiledLibrary(device.?, metallib_path);
|
|
}
|
|
|
|
// Create the Metal backend
|
|
var self = try allocator.create(Self);
|
|
errdefer allocator.destroy(self);
|
|
|
|
// Initialize the pipeline cache
|
|
var pipeline_cache = std.AutoHashMap(u64, *MTLComputePipelineState).init(allocator);
|
|
errdefer pipeline_cache.deinit();
|
|
|
|
// Initialize the buffer pool for efficient memory reuse
|
|
var buffer_pool = try BufferPool.init(allocator, device.?);
|
|
errdefer buffer_pool.deinit();
|
|
|
|
// Get optimal configuration based on the device capabilities
|
|
var config = try getMetalOptimalConfig(device.?);
|
|
|
|
self.* = .{
|
|
.device = device.?,
|
|
.command_queue = command_queue.?,
|
|
.library = library.?,
|
|
.pipeline_cache = pipeline_cache,
|
|
.allocator = allocator,
|
|
.buffer_pool = buffer_pool,
|
|
.config = config,
|
|
.stats = MetalStatistics.init(),
|
|
};
|
|
|
|
return self;
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
// Release all cached pipelines
|
|
var it = self.pipeline_cache.valueIterator();
|
|
while (it.next()) |pipeline| {
|
|
pipeline.*.release();
|
|
}
|
|
self.pipeline_cache.deinit();
|
|
|
|
// Clean up buffer pool
|
|
self.buffer_pool.deinit();
|
|
|
|
// Release Metal resources
|
|
self.library.release();
|
|
self.command_queue.release();
|
|
self.device.release();
|
|
|
|
// Free memory
|
|
self.allocator.destroy(self);
|
|
}
|
|
|
|
// Get or create a compute pipeline for a function
|
|
pub fn getPipeline(self: *Self, function_name: []const u8) !*MTLComputePipelineState {
|
|
// Hash the function name for quick lookup
|
|
const hash = std.hash.CityHash64.hash(function_name);
|
|
|
|
// Check if we already have a cached pipeline
|
|
if (self.pipeline_cache.get(hash)) |pipeline| {
|
|
return pipeline;
|
|
}
|
|
|
|
// Create a new pipeline if not found
|
|
var function = self.library.newFunctionWithName(function_name);
|
|
if (function == null) return error.MetalFunctionNotFound;
|
|
defer function.?.release();
|
|
|
|
// Create the compute pipeline
|
|
var pipeline_desc = MTLComputePipelineDescriptor.alloc().init();
|
|
defer pipeline_desc.release();
|
|
|
|
pipeline_desc.setComputeFunction(function.?);
|
|
|
|
// Enable buffer mutability tracking in debug mode
|
|
if (comptime @import("builtin").mode == .Debug) {
|
|
pipeline_desc.setMutabilityOptions(.{
|
|
.MTLPipelineBufferMutabilityAccessTracking = true,
|
|
});
|
|
}
|
|
|
|
// Enable threadgroup memory length optimization
|
|
pipeline_desc.setThreadGroupSizeIsMultipleOfThreadExecutionWidth(true);
|
|
|
|
// Create the pipeline state
|
|
var error_ptr: ?*NSError = null;
|
|
var pipeline = self.device.newComputePipelineStateWithDescriptor(
|
|
pipeline_desc,
|
|
.MTLPipelineOptionArgumentInfo,
|
|
null,
|
|
&error_ptr
|
|
);
|
|
|
|
if (pipeline == null) {
|
|
if (error_ptr != null) {
|
|
// Log the error details
|
|
const error_str = error_ptr.?.localizedDescription().UTF8String();
|
|
std.log.err("Failed to create pipeline for {s}: {s}", .{
|
|
function_name, error_str,
|
|
});
|
|
error_ptr.?.release();
|
|
}
|
|
return error.MetalPipelineCreationFailed;
|
|
}
|
|
|
|
// Cache the pipeline for future use
|
|
try self.pipeline_cache.put(hash, pipeline.?);
|
|
|
|
return pipeline.?;
|
|
}
|
|
|
|
// Execute a compute kernel with the given parameters
|
|
pub fn executeKernel(
|
|
self: *Self,
|
|
kernel_name: []const u8,
|
|
grid_size: [3]u32,
|
|
block_size: [3]u32,
|
|
buffers: []const MetalBuffer,
|
|
wait_until_completed: bool,
|
|
) !void {
|
|
// Get the pipeline for this kernel
|
|
var pipeline = try self.getPipeline(kernel_name);
|
|
|
|
// Create a command buffer
|
|
var command_buffer = self.command_queue.commandBuffer();
|
|
if (command_buffer == null) return error.MetalCommandBufferCreationFailed;
|
|
|
|
// Create a compute command encoder
|
|
var encoder = command_buffer.?.computeCommandEncoder();
|
|
if (encoder == null) return error.MetalComputeEncoderCreationFailed;
|
|
|
|
// Set the compute pipeline
|
|
encoder.?.setComputePipelineState(pipeline);
|
|
|
|
// Bind buffers
|
|
for (buffers, 0..) |buffer, i| {
|
|
encoder.?.setBuffer(buffer.handle, buffer.offset, @intCast(i));
|
|
}
|
|
|
|
// Calculate threadgroup size
|
|
var threadgroup_size = MTLSize{
|
|
.width = block_size[0],
|
|
.height = block_size[1],
|
|
.depth = block_size[2],
|
|
};
|
|
|
|
// Calculate grid size
|
|
var grid = MTLSize{
|
|
.width = grid_size[0],
|
|
.height = grid_size[1],
|
|
.depth = grid_size[2],
|
|
};
|
|
|
|
// Dispatch the compute work
|
|
encoder.?.dispatchThreadgroups(grid, threadgroup_size);
|
|
|
|
// End encoding
|
|
encoder.?.endEncoding();
|
|
|
|
// Commit the command buffer
|
|
command_buffer.?.commit();
|
|
|
|
// Wait for completion if requested
|
|
if (wait_until_completed) {
|
|
command_buffer.?.waitUntilCompleted();
|
|
}
|
|
|
|
// Update statistics
|
|
self.stats.kernel_executions += 1;
|
|
}
|
|
|
|
// Create a buffer and copy data to it
|
|
pub fn createBuffer(
|
|
self: *Self,
|
|
data: []const u8,
|
|
options: MTLResourceOptions,
|
|
) !*MTLBuffer {
|
|
// Get a buffer from the pool or create a new one
|
|
var buffer = try self.buffer_pool.getBuffer(data.len, options);
|
|
|
|
// Copy data to the buffer
|
|
@memcpy(buffer.contents()[0..data.len], data);
|
|
|
|
return buffer;
|
|
}
|
|
|
|
// Create a tensor in Metal memory
|
|
pub fn createTensor(self: *Self, tensor: Tensor(f32, 2)) !MetalTensor {
|
|
// Calculate size in bytes
|
|
const size_bytes = tensor.data.len * @sizeOf(f32);
|
|
|
|
// Create a buffer
|
|
var buffer = try self.createBuffer(
|
|
@ptrCast([*]const u8, tensor.data.ptr)[0..size_bytes],
|
|
.StorageModeShared
|
|
);
|
|
|
|
return MetalTensor{
|
|
.buffer = buffer,
|
|
.shape = tensor.shape,
|
|
.element_type = .f32,
|
|
};
|
|
}
|
|
|
|
// Example implementation of matrix multiplication using Metal
|
|
pub fn matmul(
|
|
self: *Self,
|
|
a: Tensor(f32, 2),
|
|
b: Tensor(f32, 2),
|
|
) !Tensor(f32, 2) {
|
|
// Validate dimensions
|
|
std.debug.assert(a.shape[1] == b.shape[0], "Incompatible matrix dimensions");
|
|
|
|
const m = a.shape[0];
|
|
const k = a.shape[1];
|
|
const n = b.shape[1];
|
|
|
|
// Create result tensor
|
|
var result = try Tensor(f32, 2).init(self.allocator, .{m, n});
|
|
errdefer result.deinit();
|
|
|
|
// Create Metal tensors
|
|
var a_metal = try self.createTensor(a);
|
|
defer a_metal.buffer.release();
|
|
|
|
var b_metal = try self.createTensor(b);
|
|
defer b_metal.buffer.release();
|
|
|
|
var result_metal = try self.createTensor(result);
|
|
defer result_metal.buffer.release();
|
|
|
|
// Create dimension buffer
|
|
const dims = [_]u32{@intCast(m), @intCast(k), @intCast(n)};
|
|
var dims_buffer = try self.createBuffer(
|
|
@ptrCast([*]const u8, &dims)[0..dims.len * @sizeOf(u32)],
|
|
.StorageModeShared
|
|
);
|
|
defer dims_buffer.release();
|
|
|
|
// Set up buffers
|
|
const buffers = [_]MetalBuffer{
|
|
.{ .handle = a_metal.buffer, .offset = 0 },
|
|
.{ .handle = b_metal.buffer, .offset = 0 },
|
|
.{ .handle = result_metal.buffer, .offset = 0 },
|
|
.{ .handle = dims_buffer, .offset = 0 },
|
|
};
|
|
|
|
// Calculate optimal workgroup size
|
|
const workgroup_size: [3]u32 = if (self.config.workgroup_size) |ws|
|
|
.{ @intCast(ws[0]), @intCast(ws[1]), 1 }
|
|
else
|
|
.{ 16, 16, 1 };
|
|
|
|
// Calculate grid size
|
|
const grid_size: [3]u32 = .{
|
|
(n + workgroup_size[0] - 1) / workgroup_size[0],
|
|
(m + workgroup_size[1] - 1) / workgroup_size[1],
|
|
1,
|
|
};
|
|
|
|
// Execute the kernel
|
|
try self.executeKernel(
|
|
"matmul",
|
|
grid_size,
|
|
workgroup_size,
|
|
&buffers,
|
|
true
|
|
);
|
|
|
|
// Copy data back from Metal
|
|
@memcpy(
|
|
result.data,
|
|
@ptrCast([*]const f32, result_metal.buffer.contents())[0..result.data.len]
|
|
);
|
|
|
|
return result;
|
|
}
|
|
};
|
|
|
|
// Efficient buffer pooling to avoid frequent allocations
|
|
pub const BufferPool = struct {
|
|
const Self = @This();
|
|
|
|
allocator: std.mem.Allocator,
|
|
device: *MTLDevice,
|
|
free_buffers: std.AutoHashMap(u64, std.ArrayList(*MTLBuffer)),
|
|
|
|
pub fn init(allocator: std.mem.Allocator, device: *MTLDevice) !Self {
|
|
return Self{
|
|
.allocator = allocator,
|
|
.device = device,
|
|
.free_buffers = std.AutoHashMap(u64, std.ArrayList(*MTLBuffer)).init(allocator),
|
|
};
|
|
}
|
|
|
|
pub fn deinit(self: *Self) void {
|
|
// Release all buffers
|
|
var it = self.free_buffers.valueIterator();
|
|
while (it.next()) |buffer_list| {
|
|
for (buffer_list.items) |buffer| {
|
|
buffer.release();
|
|
}
|
|
buffer_list.deinit();
|
|
}
|
|
self.free_buffers.deinit();
|
|
}
|
|
|
|
// Get a buffer of at least the requested size
|
|
pub fn getBuffer(self: *Self, size: usize, options: MTLResourceOptions) !*MTLBuffer {
|
|
// Round up to power of 2 for better reuse
|
|
const aligned_size = nextPowerOfTwo(size);
|
|
|
|
// Check if we have a free buffer of appropriate size
|
|
if (self.free_buffers.getPtr(aligned_size)) |buffer_list| {
|
|
if (buffer_list.items.len > 0) {
|
|
// Reuse an existing buffer
|
|
return buffer_list.pop();
|
|
}
|
|
}
|
|
|
|
// Create a new buffer if none available
|
|
var buffer = self.device.newBufferWithLength(aligned_size, options);
|
|
if (buffer == null) return error.MetalBufferAllocationFailed;
|
|
|
|
return buffer.?;
|
|
}
|
|
|
|
// Return a buffer to the pool for reuse
|
|
pub fn releaseBuffer(self: *Self, buffer: *MTLBuffer) !void {
|
|
const size = buffer.length();
|
|
const aligned_size = nextPowerOfTwo(size);
|
|
|
|
// Add to the appropriate size list
|
|
if (self.free_buffers.getPtr(aligned_size)) |buffer_list| {
|
|
try buffer_list.append(buffer);
|
|
} else {
|
|
// Create a new list if this is the first buffer of this size
|
|
var buffer_list = std.ArrayList(*MTLBuffer).init(self.allocator);
|
|
try buffer_list.append(buffer);
|
|
try self.free_buffers.put(aligned_size, buffer_list);
|
|
}
|
|
}
|
|
|
|
// Utility to find next power of two
|
|
fn nextPowerOfTwo(n: usize) usize {
|
|
var v = n;
|
|
v -= 1;
|
|
v |= v >> 1;
|
|
v |= v >> 2;
|
|
v |= v >> 4;
|
|
v |= v >> 8;
|
|
v |= v >> 16;
|
|
v |= v >> 32;
|
|
v += 1;
|
|
return v;
|
|
}
|
|
};
|
|
|
|
// Representation of a tensor in Metal memory
|
|
pub const MetalTensor = struct {
|
|
buffer: *MTLBuffer,
|
|
shape: []const usize,
|
|
element_type: enum {
|
|
f16,
|
|
f32,
|
|
},
|
|
};
|
|
|
|
// Helper for buffer binding
|
|
pub const MetalBuffer = struct {
|
|
handle: *MTLBuffer,
|
|
offset: u64 = 0,
|
|
};
|
|
|
|
// Statistics for performance monitoring
|
|
pub const MetalStatistics = struct {
|
|
kernel_executions: usize = 0,
|
|
bytes_transferred: usize = 0,
|
|
peak_memory_usage: usize = 0,
|
|
|
|
pub fn init() MetalStatistics {
|
|
return .{};
|
|
}
|
|
};
|
|
|
|
// Example Metal shader source for matrix multiplication
|
|
const shader_source =
|
|
\\#include <metal_stdlib>
|
|
\\using namespace metal;
|
|
\\
|
|
\\kernel void matmul(
|
|
\\ const device float* a [[buffer(0)]],
|
|
\\ const device float* b [[buffer(1)]],
|
|
\\ device float* result [[buffer(2)]],
|
|
\\ const device uint* dims [[buffer(3)]],
|
|
\\ uint2 gid [[thread_position_in_grid]],
|
|
\\ uint2 lid [[thread_position_in_threadgroup]],
|
|
\\ uint2 lsize [[threads_per_threadgroup]])
|
|
\\{
|
|
\\ const uint m = dims[0];
|
|
\\ const uint k = dims[1];
|
|
\\ const uint n = dims[2];
|
|
\\
|
|
\\ // Check if within bounds
|
|
\\ if (gid.x >= n || gid.y >= m) return;
|
|
\\
|
|
\\ // Calculate result[gid.y][gid.x]
|
|
\\ float sum = 0.0f;
|
|
\\ for (uint i = 0; i < k; i++) {
|
|
\\ sum += a[gid.y * k + i] * b[i * n + gid.x];
|
|
\\ }
|
|
\\
|
|
\\ result[gid.y * n + gid.x] = sum;
|
|
\\}
|
|
\\
|
|
\\kernel void matmul_optimized(
|
|
\\ const device float* a [[buffer(0)]],
|
|
\\ const device float* b [[buffer(1)]],
|
|
\\ device float* result [[buffer(2)]],
|
|
\\ const device uint* dims [[buffer(3)]],
|
|
\\ uint2 gid [[thread_position_in_grid]],
|
|
\\ uint2 lid [[thread_position_in_threadgroup]],
|
|
\\ uint2 lsize [[threads_per_threadgroup]])
|
|
\\{
|
|
\\ const uint m = dims[0];
|
|
\\ const uint k = dims[1];
|
|
\\ const uint n = dims[2];
|
|
\\
|
|
\\ // Check if within bounds
|
|
\\ if (gid.x >= n || gid.y >= m) return;
|
|
\\
|
|
\\ // Use threadgroup memory for caching
|
|
\\ threadgroup float a_cache[16][16];
|
|
\\ threadgroup float b_cache[16][16];
|
|
\\
|
|
\\ float sum = 0.0f;
|
|
\\
|
|
\\ // Process in tiles
|
|
\\ for (uint tile = 0; tile < (k + 15) / 16; tile++) {
|
|
\\ // Load a tile into threadgroup memory
|
|
\\ const uint tile_idx = tile * 16;
|
|
\\
|
|
\\ if (tile_idx + lid.x < k && gid.y < m) {
|
|
\\ a_cache[lid.y][lid.x] = a[gid.y * k + tile_idx + lid.x];
|
|
\\ } else {
|
|
\\ a_cache[lid.y][lid.x] = 0.0f;
|
|
\\ }
|
|
\\
|
|
\\ if (tile_idx + lid.y < k && gid.x < n) {
|
|
\\ b_cache[lid.y][lid.x] = b[(tile_idx + lid.y) * n + gid.x];
|
|
\\ } else {
|
|
\\ b_cache[lid.y][lid.x] = 0.0f;
|
|
\\ }
|
|
\\
|
|
\\ // Wait for all threads to load data
|
|
\\ threadgroup_barrier(mem_flags::mem_threadgroup);
|
|
\\
|
|
\\ // Compute partial dot product for this tile
|
|
\\ for (uint i = 0; i < 16; i++) {
|
|
\\ sum += a_cache[lid.y][i] * b_cache[i][lid.x];
|
|
\\ }
|
|
\\
|
|
\\ // Wait for all threads to finish using the cached data
|
|
\\ threadgroup_barrier(mem_flags::mem_threadgroup);
|
|
\\ }
|
|
\\
|
|
\\ // Write result
|
|
\\ if (gid.x < n && gid.y < m) {
|
|
\\ result[gid.y * n + gid.x] = sum;
|
|
\\ }
|
|
\\}
|
|
;
|
|
```
|
|
|
|
**Apple-Specific Optimizations:**
|
|
|
|
1. **Metal Shader Integration**
|
|
- Direct compilation of Metal shaders from Zig source code
|
|
- Runtime shader compilation in debug mode for easier iteration
|
|
- Precompiled metallib loading for optimized release builds
|
|
|
|
2. **Memory Management**
|
|
- Buffer pooling to minimize allocations and deallocations
|
|
- Shared memory mode for zero-copy between CPU and GPU
|
|
- Explicit control over resource storage options
|
|
|
|
3. **Performance Optimizations**
|
|
- Tile-based computation for optimal cache utilization
|
|
- Threadgroup memory usage for shared data access
|
|
- Work distribution based on detected GPU characteristics
|
|
- Pipeline state caching for faster kernel dispatching
|
|
|
|
4. **AMX Acceleration**
|
|
- Support for Apple Matrix extensions (AMX)
|
|
- Specialized matrix multiplication operations for M-series chips
|
|
- Custom shader variants optimized for different Apple Silicon generations
|
|
|
|
5. **Neural Engine Integration**
|
|
- Optional ANE (Apple Neural Engine) offloading for supported operations
|
|
- Hybrid execution strategies combining GPU and Neural Engine
|
|
- Automatic fallback to Metal for unsupported operations
|
|
|
|
|
|
### 4. Inference Pipeline
|
|
|
|
The inference pipeline is the core execution flow for running the DeepSeek V3 model. Our Zig implementation focuses on efficiency, flexibility, and streaming capabilities.
|
|
|
|
#### 4.1 Model Loading
|
|
|
|
```zig
|
|
// The ModelLoader handles loading and initializing DeepSeek V3 models
|
|
pub const ModelLoader = struct {
|
|
const Self = @This();
|
|
|
|
allocator: std.mem.Allocator,
|
|
config: LoaderConfig,
|
|
|
|
// Configuration for model loading
|
|
pub const LoaderConfig = struct {
|
|
// Number of threads to use for weight loading
|
|
loading_threads: ?usize = null,
|
|
|
|
// Optional cache directory for model weights
|
|
cache_dir: ?[]const u8 = null,
|
|
|
|
// How to handle safetensors format
|
|
safetensors_memory_map: bool = true,
|
|
|
|
// Validation level for loaded weights
|
|
validation: enum {
|
|
none,
|
|
basic,
|
|
full
|
|
} = .basic,
|
|
|
|
// Device to place model on after loading
|
|
target_device: BackendType = .Cpu,
|
|
};
|
|
|
|
pub fn init(allocator: std.mem.Allocator, config: LoaderConfig) Self {
|
|
return .{
|
|
.allocator = allocator,
|
|
.config = config,
|
|
};
|
|
}
|
|
|
|
// Load a model from file
|
|
pub fn loadModel(
|
|
self: *Self,
|
|
path: []const u8,
|
|
model_args: ?ModelArgs,
|
|
) !*TransformerModel {
|
|
const extension = std.fs.path.extension(path);
|
|
|
|
// Determine model format from file extension
|
|
if (std.mem.eql(u8, extension, ".safetensors")) {
|
|
return try self.loadFromSafetensors(path, model_args);
|
|
} else if (std.mem.eql(u8, extension, ".ckpt")) {
|
|
return try self.loadFromCheckpoint(path, model_args);
|
|
} else if (std.mem.eql(u8, extension, ".bin")) {
|
|
return try self.loadFromBinary(path, model_args);
|
|
} else if (std.fs.cwd().accessZ(path, .{}) == .AccessDenied) {
|
|
// Could be a Hugging Face model ID, try to download it
|
|
return try self.loadFromHuggingFace(path, model_args);
|
|
}
|
|
|
|
return error.UnsupportedModelFormat;
|
|
}
|
|
|
|
// Load model from SafeTensors format (optimized for memory mapping)
|
|
fn loadFromSafetensors(
|
|
self: *Self,
|
|
path: []const u8,
|
|
model_args: ?ModelArgs,
|
|
) !*TransformerModel {
|
|
// Open the safetensors file
|
|
var file = try std.fs.cwd().openFile(path, .{});
|
|
defer file.close();
|
|
|
|
// Memory map the file for zero-copy access if configured
|
|
if (self.config.safetensors_memory_map) {
|
|
const file_size = try file.getEndPos();
|
|
|
|
// Memory map the file
|
|
const mapped_memory = try std.os.mmap(
|
|
null,
|
|
file_size,
|
|
std.os.PROT.READ,
|
|
std.os.MAP.PRIVATE,
|
|
file.handle,
|
|
0,
|
|
);
|
|
|
|
// Process the memory-mapped safetensors
|
|
return try self.processSafetensorsMemoryMapped(
|
|
mapped_memory,
|
|
file_size,
|
|
model_args,
|
|
);
|
|
} else {
|
|
// If memory mapping is disabled, read the file conventionally
|
|
return try self.processSafetensorsFile(file, model_args);
|
|
}
|
|
}
|
|
|
|
// Process a memory-mapped SafeTensors file
|
|
fn processSafetensorsMemoryMapped(
|
|
self: *Self,
|
|
memory: []const u8,
|
|
file_size: usize,
|
|
model_args: ?ModelArgs,
|
|
) !*TransformerModel {
|
|
// Parse the header which contains tensor metadata
|
|
const header_size = std.mem.readIntLittle(u64, memory[0..8]);
|
|
const header_json = memory[8..8+header_size];
|
|
|
|
// Parse the JSON header
|
|
var parsed = try std.json.parseFromSlice(
|
|
std.json.Value,
|
|
self.allocator,
|
|
header_json,
|
|
.{},
|
|
);
|
|
defer parsed.deinit();
|
|
|
|
// Get the model configuration from arguments or try to infer it
|
|
const args = try self.determineModelArgs(model_args, parsed.value);
|
|
|
|
// Create the model with the determined configuration
|
|
var model = try TransformerModel.create(self.allocator, args);
|
|
errdefer model.destroy();
|
|
|
|
// Create a tensor mapping for zero-copy loading
|
|
try self.loadTensorsFromSafetensorsMemory(
|
|
model,
|
|
memory,
|
|
header_size,
|
|
parsed.value,
|
|
);
|
|
|
|
// Validate the loaded model if configured
|
|
if (self.config.validation != .none) {
|
|
try self.validateModel(model, parsed.value);
|
|
}
|
|
|
|
return model;
|
|
}
|
|
|
|
// Load a model from Hugging Face
|
|
fn loadFromHuggingFace(
|
|
self: *Self,
|
|
model_id: []const u8,
|
|
model_args: ?ModelArgs,
|
|
) !*TransformerModel {
|
|
// Get cache directory or create a temporary one
|
|
const cache_dir = self.config.cache_dir orelse
|
|
try std.fs.getAppDataDir(self.allocator, "deepseek-zig");
|
|
|
|
// Create HF client
|
|
var hf_client = try HuggingFaceClient.init(self.allocator, cache_dir);
|
|
defer hf_client.deinit();
|
|
|
|
// Download the model
|
|
const model_path = try hf_client.downloadModel(model_id);
|
|
|
|
// Load the downloaded model
|
|
return try self.loadModel(model_path, model_args);
|
|
}
|
|
|
|
// Infer model arguments if not explicitly provided
|
|
fn determineModelArgs(
|
|
self: *Self,
|
|
model_args: ?ModelArgs,
|
|
header: std.json.Value,
|
|
) !ModelArgs {
|
|
if (model_args) |args| {
|
|
return args;
|
|
}
|
|
|
|
// Try to infer model configuration from the weight shapes
|
|
if (header.Object.get("metadata")) |metadata| {
|
|
if (metadata.Object.get("model_type")) |model_type| {
|
|
if (std.mem.eql(u8, model_type.String, "deepseek")) {
|
|
// Extract dimensions from metadata
|
|
return try self.parseDeepSeekConfig(metadata);
|
|
}
|
|
}
|
|
}
|
|
|
|
// Infer from weight shapes if metadata is not available
|
|
return try self.inferArgsFromWeights(header);
|
|
}
|
|
|
|
// ... more implementation details ...
|
|
};
|
|
|
|
// Implementation of TransformerModel
|
|
pub const TransformerModel = struct {
|
|
const Self = @This();
|
|
|
|
allocator: std.mem.Allocator,
|
|
args: ModelArgs,
|
|
|
|
// Tokenizer for text processing
|
|
tokenizer: *Tokenizer,
|
|
|
|
// Model components
|
|
embedding: *Embedding,
|
|
layers: []TransformerLayer,
|
|
norm: *LayerNorm,
|
|
lm_head: *Linear,
|
|
|
|
// KV cache for efficient inference
|
|
kv_cache: ?*KVCache,
|
|
|
|
// Backend for computation
|
|
backend: *ComputeBackend,
|
|
|
|
// Create a model with the given configuration
|
|
pub fn create(
|
|
allocator: std.mem.Allocator,
|
|
args: ModelArgs,
|
|
) !*Self {
|
|
// Create model components
|
|
var embedding = try Embedding.create(allocator, args);
|
|
errdefer embedding.destroy();
|
|
|
|
var layers = try allocator.alloc(TransformerLayer, args.num_layers);
|
|
errdefer allocator.free(layers);
|
|
|
|
for (layers, 0..) |*layer, i| {
|
|
layer.* = try TransformerLayer.create(allocator, args, i);
|
|
}
|
|
|
|
var norm = try LayerNorm.create(allocator, args.dim);
|
|
errdefer norm.destroy();
|
|
|
|
var lm_head = try Linear.create(allocator, args.dim, args.vocab_size);
|
|
errdefer lm_head.destroy();
|
|
|
|
// Initialize compute backend
|
|
var backend = try ComputeBackend.create(allocator);
|
|
errdefer backend.destroy();
|
|
|
|
// Initialize tokenizer
|
|
var tokenizer = try Tokenizer.create(allocator, args.vocab_size);
|
|
errdefer tokenizer.destroy();
|
|
|
|
// Create the model
|
|
var model = try allocator.create(Self);
|
|
errdefer allocator.destroy(model);
|
|
|
|
model.* = .{
|
|
.allocator = allocator,
|
|
.args = args,
|
|
.tokenizer = tokenizer,
|
|
.embedding = embedding,
|
|
.layers = layers,
|
|
.norm = norm,
|
|
.lm_head = lm_head,
|
|
.kv_cache = null,
|
|
.backend = backend,
|
|
};
|
|
|
|
return model;
|
|
}
|
|
|
|
// Clean up resources
|
|
pub fn destroy(self: *Self) void {
|
|
// Free all components
|
|
self.tokenizer.destroy();
|
|
self.embedding.destroy();
|
|
|
|
for (self.layers) |*layer| {
|
|
layer.deinit();
|
|
}
|
|
self.allocator.free(self.layers);
|
|
|
|
self.norm.destroy();
|
|
self.lm_head.destroy();
|
|
|
|
if (self.kv_cache) |kv_cache| {
|
|
kv_cache.destroy();
|
|
}
|
|
|
|
self.backend.destroy();
|
|
self.allocator.destroy(self);
|
|
}
|
|
|
|
// Load a model from a specific path
|
|
pub fn loadFromPath(
|
|
allocator: std.mem.Allocator,
|
|
path: []const u8,
|
|
args: ?ModelArgs,
|
|
) !*Self {
|
|
var loader = ModelLoader.init(allocator, .{});
|
|
return try loader.loadModel(path, args);
|
|
}
|
|
|
|
// Forward pass for a single token
|
|
pub fn forward(
|
|
self: *Self,
|
|
token_id: usize,
|
|
position: usize,
|
|
) !Tensor(f32, 2) {
|
|
// Get the token embedding
|
|
var x = try self.embedding.forward(token_id);
|
|
|
|
// Process through all transformer layers
|
|
for (self.layers, 0..) |*layer, i| {
|
|
x = try layer.forward(x, position, self.kv_cache);
|
|
}
|
|
|
|
// Apply final layer norm
|
|
x = try self.norm.forward(x);
|
|
|
|
// Project to vocabulary
|
|
return try self.lm_head.forward(x);
|
|
}
|
|
|
|
// Prepare the model for generation
|
|
pub fn prepareForGeneration(
|
|
self: *Self,
|
|
max_seq_len: usize,
|
|
batch_size: usize,
|
|
) !void {
|
|
// Create KV cache if not already created
|
|
if (self.kv_cache == null) {
|
|
self.kv_cache = try KVCache.create(
|
|
self.allocator,
|
|
self.args,
|
|
max_seq_len,
|
|
batch_size,
|
|
);
|
|
} else {
|
|
// Reset the cache if it already exists
|
|
try self.kv_cache.?.reset(max_seq_len, batch_size);
|
|
}
|
|
}
|
|
|
|
// Load tokenizer from vocabulary file
|
|
pub fn loadTokenizer(
|
|
self: *Self,
|
|
path: []const u8,
|
|
) !void {
|
|
try self.tokenizer.loadFromFile(path);
|
|
}
|
|
};
|
|
```
|
|
|
|
#### 4.2 Generation Strategies
|
|
|
|
```zig
|
|
// Configuration for text generation
|
|
pub const GenerationConfig = struct {
|
|
// Maximum new tokens to generate
|
|
max_new_tokens: usize = 128,
|
|
|
|
// Sampling temperature (higher = more random)
|
|
temperature: f32 = 1.0,
|
|
|
|
// Top-p sampling parameter (0.0-1.0)
|
|
top_p: f32 = 1.0,
|
|
|
|
// Top-k sampling parameter (0 = disabled)
|
|
top_k: usize = 0,
|
|
|
|
// Repetition penalty to prevent looping
|
|
repetition_penalty: f32 = 1.0,
|
|
|
|
// Whether to use sampling or greedy decoding
|
|
do_sample: bool = true,
|
|
|
|
// Frequency penalty for repeated tokens
|
|
frequency_penalty: f32 = 0.0,
|
|
|
|
// Presence penalty for token occurrence
|
|
presence_penalty: f32 = 0.0,
|
|
|
|
// Stop sequences to terminate generation
|
|
stop_sequences: ?[]const []const u8 = null,
|
|
|
|
// Minimum number of tokens to generate
|
|
min_new_tokens: ?usize = null,
|
|
|
|
// Beam search width (1 = greedy)
|
|
num_beams: usize = 1,
|
|
|
|
// Random seed for reproducibility
|
|
seed: ?u64 = null,
|
|
|
|
// Whether to use speculative decoding
|
|
use_speculative: bool = false,
|
|
|
|
// Draft model for speculative decoding
|
|
draft_model: ?*TransformerModel = null,
|
|
|
|
// Number of speculative tokens to generate at once
|
|
speculative_tokens: usize = 5,
|
|
};
|
|
|
|
// Generate text from a model given input tokens
|
|
pub fn generate(
|
|
model: *TransformerModel,
|
|
input_ids: []const usize,
|
|
config: GenerationConfig,
|
|
callback: ?fn ([]const u8) void,
|
|
) ![]usize {
|
|
// Initialize RNG with seed if provided
|
|
var rng = if (config.seed) |seed|
|
|
std.rand.DefaultPrng.init(seed)
|
|
else
|
|
std.rand.DefaultPrng.init(@bitCast(u64, std.time.milliTimestamp()));
|
|
|
|
// Allocate result buffer
|
|
var result = try model.allocator.alloc(
|
|
usize,
|
|
input_ids.len + config.max_new_tokens,
|
|
);
|
|
errdefer model.allocator.free(result);
|
|
|
|
// Copy input tokens
|
|
@memcpy(result[0..input_ids.len], input_ids);
|
|
var token_count = input_ids.len;
|
|
|
|
// Prepare model for generation
|
|
try model.prepareForGeneration(
|
|
input_ids.len + config.max_new_tokens,
|
|
1, // Batch size
|
|
);
|
|
|
|
// Process all input tokens to fill KV cache
|
|
var position: usize = 0;
|
|
for (input_ids) |token_id| {
|
|
_ = try model.forward(token_id, position);
|
|
position += 1;
|
|
}
|
|
|
|
// Check if we should use speculative decoding
|
|
if (config.use_speculative and config.draft_model != null) {
|
|
return try speculativeGenerate(
|
|
model,
|
|
config.draft_model.?,
|
|
result,
|
|
token_count,
|
|
position,
|
|
config,
|
|
callback,
|
|
);
|
|
}
|
|
|
|
// Set up logit processors based on config
|
|
var logit_processors = LogitProcessorList.init(model.allocator);
|
|
defer logit_processors.deinit();
|
|
|
|
if (config.temperature != 1.0) {
|
|
try logit_processors.add(TemperatureLogitProcessor.init(config.temperature));
|
|
}
|
|
|
|
if (config.repetition_penalty != 1.0) {
|
|
try logit_processors.add(RepetitionPenaltyLogitProcessor.init(
|
|
config.repetition_penalty,
|
|
result[0..token_count],
|
|
));
|
|
}
|
|
|
|
if (config.frequency_penalty != 0.0 or config.presence_penalty != 0.0) {
|
|
try logit_processors.add(FrequencyPenaltyLogitProcessor.init(
|
|
config.frequency_penalty,
|
|
config.presence_penalty,
|
|
));
|
|
}
|
|
|
|
// Main generation loop
|
|
while (token_count < result.len) {
|
|
// Get next token logits
|
|
var logits = try model.forward(result[token_count - 1], position);
|
|
defer logits.deinit();
|
|
|
|
// Apply logit processors
|
|
try logit_processors.process(&logits, result[0..token_count]);
|
|
|
|
// Sample next token
|
|
const next_token = if (config.do_sample)
|
|
try sampleNextToken(
|
|
model.allocator,
|
|
logits,
|
|
config.top_p,
|
|
config.top_k,
|
|
&rng.random(),
|
|
)
|
|
else
|
|
try greedyNextToken(logits);
|
|
|
|
// Add token to result
|
|
result[token_count] = next_token;
|
|
token_count += 1;
|
|
position += 1;
|
|
|
|
// Check for stop sequences
|
|
if (config.stop_sequences) |stop_seqs| {
|
|
if (checkStopSequences(
|
|
model.tokenizer,
|
|
result[0..token_count],
|
|
stop_seqs,
|
|
)) {
|
|
break;
|
|
}
|
|
}
|
|
|
|
// Call callback with generated token if provided
|
|
if (callback != null) {
|
|
var token_text = try model.tokenizer.decodeTokens(
|
|
model.allocator,
|
|
result[token_count-1..token_count],
|
|
);
|
|
defer model.allocator.free(token_text);
|
|
|
|
callback.?(token_text);
|
|
}
|
|
|
|
// Check if we've reached minimum token count
|
|
if (config.min_new_tokens) |min_tokens| {
|
|
if (token_count >= input_ids.len + min_tokens) {
|
|
// Check if we're at an EOS token
|
|
if (next_token == model.tokenizer.eos_token_id) {
|
|
break;
|
|
}
|
|
}
|
|
} else if (next_token == model.tokenizer.eos_token_id) {
|
|
// Otherwise just stop at EOS
|
|
break;
|
|
}
|
|
}
|
|
|
|
// Resize result to actual number of tokens
|
|
result = try model.allocator.realloc(result, token_count);
|
|
return result;
|
|
}
|
|
|
|
// Speculative decoding implementation
|
|
fn speculativeGenerate(
|
|
model: *TransformerModel,
|
|
draft_model: *TransformerModel,
|
|
result: []usize,
|
|
token_count: usize,
|
|
position: usize,
|
|
config: GenerationConfig,
|
|
callback: ?fn ([]const u8) void,
|
|
) ![]usize {
|
|
// Implementation of speculative decoding algorithm
|
|
// This generates multiple tokens using a smaller draft model
|
|
// and verifies them with the main model for faster generation
|
|
|
|
// ... implementation details ...
|
|
return result;
|
|
}
|
|
|
|
// Sample next token using top-p (nucleus) and top-k sampling
|
|
fn sampleNextToken(
|
|
allocator: std.mem.Allocator,
|
|
logits: Tensor(f32, 2),
|
|
top_p: f32,
|
|
top_k: usize,
|
|
random: *std.rand.Random,
|
|
) !usize {
|
|
const vocab_size = logits.shape[1];
|
|
|
|
// Create a sorted list of (token_id, probability) pairs
|
|
var token_probs = try allocator.alloc(
|
|
struct { token_id: usize, prob: f32 },
|
|
vocab_size,
|
|
);
|
|
defer allocator.free(token_probs);
|
|
|
|
// Apply softmax to get probabilities
|
|
var probs = try softmax(allocator, logits);
|
|
defer probs.deinit();
|
|
|
|
// Fill token_probs array
|
|
for (0..vocab_size) |i| {
|
|
token_probs[i] = .{
|
|
.token_id = i,
|
|
.prob = probs.data[i],
|
|
};
|
|
}
|
|
|
|
// Sort by probability (descending)
|
|
std.sort.sort(
|
|
struct { token_id: usize, prob: f32 },
|
|
token_probs,
|
|
{},
|
|
struct {
|
|
fn lessThan(_: void, a: struct { token_id: usize, prob: f32 }, b: struct { token_id: usize, prob: f32 }) bool {
|
|
return b.prob < a.prob;
|
|
}
|
|
}.lessThan,
|
|
);
|
|
|
|
// Apply top-k filtering if enabled
|
|
const k = if (top_k > 0)
|
|
@min(top_k, vocab_size)
|
|
else
|
|
vocab_size;
|
|
|
|
// Apply top-p filtering
|
|
var cumulative_prob: f32 = 0.0;
|
|
var last_idx: usize = 0;
|
|
|
|
for (token_probs[0..k], 0..) |tp, i| {
|
|
cumulative_prob += tp.prob;
|
|
if (cumulative_prob >= top_p) {
|
|
last_idx = i;
|
|
break;
|
|
}
|
|
}
|
|
|
|
// Sample from the filtered distribution
|
|
const rand_val = random.float(f32);
|
|
var curr_prob: f32 = 0.0;
|
|
|
|
for (token_probs[0..last_idx+1]) |tp| {
|
|
curr_prob += tp.prob;
|
|
if (rand_val < curr_prob) {
|
|
return tp.token_id;
|
|
}
|
|
}
|
|
|
|
// Fallback to the highest probability token
|
|
return token_probs[0].token_id;
|
|
}
|
|
```
|
|
|
|
**Advanced Features:**
|
|
|
|
1. **Speculative Decoding**
|
|
- Implementation of speculative decoding using a smaller draft model
|
|
- Verification and acceptance/rejection of speculated tokens
|
|
- Significant speedup in generation throughput
|
|
|
|
2. **Streaming Token Output**
|
|
- Callback-based token streaming for real-time results
|
|
- Zero-copy token decoding for minimal overhead
|
|
- Support for incremental UI updates
|
|
|
|
3. **Custom Sampling Strategies**
|
|
- Top-p (nucleus) sampling with dynamic probability mass cutoff
|
|
- Top-k sampling with configurable k value
|
|
- Temperature scaling for controlling randomness
|
|
- Repetition penalty to prevent loops and repetitive text
|
|
- Frequency and presence penalties for more diverse output
|
|
|
|
4. **Stop Sequence Detection**
|
|
- Efficient detection of multiple stop sequences
|
|
- Support for subword token matching across boundaries
|
|
- Early termination based on generated content
|
|
|
|
5. **Beam Search Implementation**
|
|
- Configurable beam width for exploring multiple generation paths
|
|
- Length normalization for balancing short and long outputs
|
|
- Diverse beam groups to prevent similar outputs
|
|
|
|
6. **Memory Efficiency**
|
|
- KV-cache memory management for long context handling
|
|
- Incremental cache updates for streaming inference
|
|
- Automatic cache pruning for memory optimization
|
|
|
|
7. **Performance Optimizations**
|
|
- Batched token processing for higher throughput
|
|
- Parallel sampling for multi-sequence generation
|
|
- SIMD-accelerated logit processing
|
|
- Compile-time specialization for common configuration patterns
|
|
|
|
### 5. Optimization Layer
|
|
|
|
The optimization layer leverages Zig's unique features to maximise performance across different hardware targets.
|
|
|
|
#### 5.1 Compile-Time Optimizations
|
|
|
|
Zig's powerful compile-time metaprogramming enables us to generate highly specialized code for specific hardware and model configurations:
|
|
|
|
```zig
|
|
// Specialized matrix multiplication kernels generated at compile-time
|
|
pub fn generateMatmulKernel(comptime config: KernelConfig) type {
|
|
return struct {
|
|
const Self = @This();
|
|
|
|
// Compile-time configuration
|
|
const M = config.M;
|
|
const N = config.N;
|
|
const K = config.K;
|
|
const block_size = config.block_size;
|
|
const vector_width = config.vector_width;
|
|
const use_fma = config.use_fma;
|
|
|
|
// Vector type based on configuration
|
|
const Vec = @Vector(vector_width, f32);
|
|
|
|
// Matmul implementation specialized for the given dimensions
|
|
pub fn matmul(
|
|
a: *const [M][K]f32,
|
|
b: *const [K][N]f32,
|
|
c: *[M][N]f32,
|
|
) void {
|
|
// Use specialized implementation for small matrices
|
|
if (comptime M <= 4 and N <= 4 and K <= 4) {
|
|
return smallMatmul(a, b, c);
|
|
}
|
|
|
|
// Use blocked implementation for larger matrices
|
|
return blockedMatmul(a, b, c);
|
|
}
|
|
|
|
// Specialized implementation for small matrices
|
|
// Fully unrolled at compile time
|
|
fn smallMatmul(
|
|
a: *const [M][K]f32,
|
|
b: *const [K][N]f32,
|
|
c: *[M][N]f32,
|
|
) void {
|
|
inline for (0..M) |i| {
|
|
inline for (0..N) |j| {
|
|
var sum: f32 = 0;
|
|
inline for (0..K) |k| {
|
|
sum += a[i][k] * b[k][j];
|
|
}
|
|
c[i][j] = sum;
|
|
}
|
|
}
|
|
}
|
|
|
|
// Cache-blocked implementation for larger matrices
|
|
fn blockedMatmul(
|
|
a: *const [M][K]f32,
|
|
b: *const [K][N]f32,
|
|
c: *[M][N]f32,
|
|
) void {
|
|
// Compute using blocks for better cache utilization
|
|
comptime var i_block: usize = 0;
|
|
inline while (i_block < M) : (i_block += block_size) {
|
|
comptime var j_block: usize = 0;
|
|
inline while (j_block < N) : (j_block += block_size) {
|
|
comptime var k_block: usize = 0;
|
|
inline while (k_block < K) : (k_block += block_size) {
|
|
const i_end = @min(i_block + block_size, M);
|
|
const j_end = @min(j_block + block_size, N);
|
|
const k_end = @min(k_block + block_size, K);
|
|
|
|
// Process current block
|
|
for (i_block..i_end) |i| {
|
|
for (j_block..j_end) |j| {
|
|
var sum: f32 = c[i][j];
|
|
|
|
// Vectorized inner loop when possible
|
|
if (comptime vector_width > 1 and (k_end - k_block) >= vector_width) {
|
|
var k_vec: usize = k_block;
|
|
var acc: Vec = @splat(0.0);
|
|
|
|
while (k_vec + vector_width <= k_end) : (k_vec += vector_width) {
|
|
const a_vec: Vec = blk: {
|
|
var tmp: [vector_width]f32 = undefined;
|
|
for (0..vector_width) |vi| {
|
|
tmp[vi] = a[i][k_vec + vi];
|
|
}
|
|
break :blk tmp;
|
|
};
|
|
|
|
const b_vec: Vec = blk: {
|
|
var tmp: [vector_width]f32 = undefined;
|
|
for (0..vector_width) |vi| {
|
|
tmp[vi] = b[k_vec + vi][j];
|
|
}
|
|
break :blk tmp;
|
|
};
|
|
|
|
// Use FMA instruction if available
|
|
if (comptime use_fma) {
|
|
acc = @mulAdd(Vec, a_vec, b_vec, acc);
|
|
} else {
|
|
acc += a_vec * b_vec;
|
|
}
|
|
}
|
|
|
|
// Reduce vector to scalar
|
|
for (0..vector_width) |vi| {
|
|
sum += acc[vi];
|
|
}
|
|
|
|
// Handle remaining elements
|
|
for (k_vec..k_end) |k| {
|
|
sum += a[i][k] * b[k][j];
|
|
}
|
|
} else {
|
|
// Scalar fallback
|
|
for (k_block..k_end) |k| {
|
|
sum += a[i][k] * b[k][j];
|
|
}
|
|
}
|
|
|
|
c[i][j] = sum;
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
};
|
|
}
|
|
|
|
// Configuration for kernel generation
|
|
pub const KernelConfig = struct {
|
|
// Matrix dimensions (can be comptime_int or dynamic)
|
|
M: comptime_int,
|
|
N: comptime_int,
|
|
K: comptime_int,
|
|
|
|
// Blocking configuration for cache optimization
|
|
block_size: comptime_int = 32,
|
|
|
|
// Vector width for SIMD operations
|
|
vector_width: comptime_int = 4,
|
|
|
|
// Whether to use FMA instructions when available
|
|
use_fma: bool = true,
|
|
};
|
|
|
|
// Usage: Create specialized kernels at compile time
|
|
// Fully unrolled 4x4 matrix multiplication
|
|
const Kernel4x4 = generateMatmulKernel(.{
|
|
.M = 4,
|
|
.N = 4,
|
|
.K = 4,
|
|
.vector_width = 4,
|
|
});
|
|
|
|
// Cache-friendly 128x128 matrix multiplication
|
|
const Kernel128x128 = generateMatmulKernel(.{
|
|
.M = 128,
|
|
.N = 128,
|
|
.K = 128,
|
|
.block_size = 32,
|
|
.vector_width = 8,
|
|
});
|
|
|
|
// Runtime dispatch to select the best kernel based on matrix dimensions
|
|
pub fn dispatchMatmul(
|
|
allocator: std.mem.Allocator,
|
|
a: Tensor(f32, 2),
|
|
b: Tensor(f32, 2),
|
|
) !Tensor(f32, 2) {
|
|
// Check dimensions
|
|
const m = a.shape[0];
|
|
const k = a.shape[1];
|
|
const n = b.shape[1];
|
|
|
|
std.debug.assert(k == b.shape[0], "Incompatible matrix dimensions");
|
|
|
|
// Create result tensor
|
|
var result = try Tensor(f32, 2).init(allocator, .{m, n});
|
|
errdefer result.deinit();
|
|
|
|
// Initialize result to zeros
|
|
@memset(result.data, 0);
|
|
|
|
// Dispatch to specialized kernels if dimensions match exactly
|
|
if (m == 4 and n == 4 and k == 4) {
|
|
// Use specialized 4x4 kernel
|
|
Kernel4x4.matmul(
|
|
@ptrCast(*const [4][4]f32, a.data),
|
|
@ptrCast(*const [4][4]f32, b.data),
|
|
@ptrCast(*[4][4]f32, result.data),
|
|
);
|
|
} else if (m == 128 and n == 128 and k == 128) {
|
|
// Use specialized 128x128 kernel
|
|
Kernel128x128.matmul(
|
|
@ptrCast(*const [128][128]f32, a.data),
|
|
@ptrCast(*const [128][128]f32, b.data),
|
|
@ptrCast(*[128][128]f32, result.data),
|
|
);
|
|
} else {
|
|
// Use generic implementation for arbitrary dimensions
|
|
try genericMatmul(a, b, &result);
|
|
}
|
|
|
|
return result;
|
|
}
|
|
|
|
// Apply compile-time metaprogramming to optimize data layouts
|
|
pub fn optimizedTensorLayout(comptime T: type, comptime dims: []const usize) type {
|
|
return struct {
|
|
const Self = @This();
|
|
|
|
// Determine optimal memory layout at compile time
|
|
const optimal_layout = optimizeMemoryLayout(T, dims);
|
|
|
|
// Data storage with optimized layout
|
|
data: [product(dims)]T align(optimal_layout.alignment),
|
|
shape: [dims.len]usize,
|
|
strides: [dims.len]usize,
|
|
|
|
// Tensor initialization with optimal layout
|
|
pub fn init(allocator: std.mem.Allocator) !Self {
|
|
const data = try allocator.alignedAlloc(
|
|
T,
|
|
optimal_layout.alignment,
|
|
product(dims),
|
|
);
|
|
|
|
// Calculate optimal strides based on layout
|
|
var strides: [dims.len]usize = undefined;
|
|
if (optimal_layout.row_major) {
|
|
// Row-major strides
|
|
var stride: usize = 1;
|
|
var i: usize = dims.len;
|
|
while (i > 0) {
|
|
i -= 1;
|
|
strides[i] = stride;
|
|
stride *= dims[i];
|
|
}
|
|
} else {
|
|
// Column-major strides
|
|
var stride: usize = 1;
|
|
for (0..dims.len) |i| {
|
|
strides[i] = stride;
|
|
stride *= dims[i];
|
|
}
|
|
}
|
|
|
|
return Self{
|
|
.data = data,
|
|
.shape = dims,
|
|
.strides = strides,
|
|
};
|
|
}
|
|
|
|
// Helper function to calculate optimal memory layout
|
|
fn optimizeMemoryLayout(comptime T: type, comptime dims: []const usize) struct {
|
|
row_major: bool,
|
|
alignment: u29,
|
|
} {
|
|
// Use column-major for matrices where the first dimension is much larger
|
|
// This often improves cache locality for common access patterns
|
|
const row_major = if (dims.len == 2)
|
|
dims[0] <= dims[1] * 2
|
|
else
|
|
true;
|
|
|
|
// Determine optimal alignment based on vector units
|
|
const alignment = if (@sizeOf(T) == 4 and comptime std.Target.current.cpu.arch == .x86_64)
|
|
if (comptime std.Target.current.cpu.features.isEnabled(.avx512f))
|
|
64 // 512-bit alignment for AVX-512
|
|
else if (comptime std.Target.current.cpu.features.isEnabled(.avx2))
|
|
32 // 256-bit alignment for AVX2
|
|
else if (comptime std.Target.current.cpu.features.isEnabled(.sse2))
|
|
16 // 128-bit alignment for SSE2
|
|
else
|
|
@alignOf(T)
|
|
else
|
|
@alignOf(T);
|
|
|
|
return .{
|
|
.row_major = row_major,
|
|
.alignment = alignment,
|
|
};
|
|
}
|
|
|
|
// Helper to calculate the product of dimensions
|
|
fn product(comptime dims: []const usize) usize {
|
|
var result: usize = 1;
|
|
for (dims) |dim| {
|
|
result *= dim;
|
|
}
|
|
return result;
|
|
}
|
|
};
|
|
}
|
|
```
|
|
|
|
**Key Compile-Time Techniques:**
|
|
|
|
1. **Matrix Operation Specialization**
|
|
- Specialized kernels generated at compile-time for common dimensions
|
|
- Full loop unrolling for small matrices
|
|
- Compile-time configurable blocking strategies for cache optimization
|
|
|
|
2. **Data Layout Optimization**
|
|
- Automatic selection of row-major or column-major layout based on dimensions
|
|
- Optimal memory alignment for target architecture's vector units
|
|
- Compile-time stride calculation for fast indexing
|
|
|
|
3. **Architecture-Specific Optimizations**
|
|
- Vector width specialization based on target CPU features
|
|
- Automatic use of FMA instructions when available
|
|
- SIMD instruction generation tailored to the target architecture
|
|
|
|
4. **Kernel Selection**
|
|
- Runtime dispatch to specialized kernels based on input dimensions
|
|
- Fallback to generic implementation for arbitrary dimensions
|
|
- Compile-time branch elimination for performance-critical paths
|
|
|
|
#### 5.2 Quantization Framework
|
|
|
|
Our quantization framework allows for efficient low-precision inference while maintaining accuracy:
|
|
|
|
```zig
|
|
// Quantization configuration
|
|
pub const QuantizationConfig = struct {
|
|
// Precision of quantized values
|
|
bits: u8 = 8,
|
|
|
|
// Quantization scheme
|
|
scheme: enum {
|
|
symmetric, // Zero-point is always 0, simplifies arithmetic
|
|
asymmetric, // Allows representing the full range more precisely
|
|
} = .symmetric,
|
|
|
|
// Quantization granularity
|
|
granularity: enum {
|
|
per_tensor, // One scale for the entire tensor
|
|
per_channel, // Different scale for each output channel
|
|
} = .per_tensor,
|
|
|
|
// Whether to use integer or float16 quantization
|
|
use_float16: bool = false,
|
|
|
|
// Calibration strategy
|
|
calibration: enum {
|
|
minmax, // Simple min/max scaling
|
|
entropy, // Entropy-based quantization
|
|
percentile, // Clip to percentile range for outliers
|
|
} = .minmax,
|
|
|
|
// Percentile value for calibration (0.0-1.0)
|
|
percentile: f32 = 0.99995,
|
|
};
|
|
|
|
// Quantized tensor type that tracks quantization parameters
|
|
pub fn QuantizedTensor(comptime original_type: type, comptime bits: u8) type {
|
|
return struct {
|
|
const Self = @This();
|
|
|
|
// Determine the appropriate integer type based on bit width
|
|
const IntType = std.meta.Int(.unsigned, bits);
|
|
|
|
// Original element type for reference
|
|
pub const OriginalType = original_type;
|
|
|
|
// Quantized data
|
|
data: []IntType,
|
|
|
|
// Original tensor shape
|
|
shape: []const usize,
|
|
|
|
// Quantization parameters
|
|
scale: []f32,
|
|
zero_point: []IntType,
|
|
|
|
// Whether scale/zero_point are per-tensor or per-channel
|
|
per_channel: bool,
|
|
|
|
// For asymmetric quantization: minimum representable value
|
|
qmin: IntType,
|
|
|
|
// For asymmetric quantization: maximum representable value
|
|
qmax: IntType,
|
|
|
|
// Channel dimension for per-channel quantization
|
|
channel_dim: ?usize,
|
|
|
|
// Memory allocator for cleanup
|
|
allocator: std.mem.Allocator,
|
|
|
|
// Initialize a quantized tensor
|
|
pub fn init(
|
|
allocator: std.mem.Allocator,
|
|
shape: []const usize,
|
|
per_channel: bool,
|
|
channel_dim: ?usize,
|
|
) !Self {
|
|
// Calculate total size
|
|
var total_size: usize = 1;
|
|
for (shape) |dim| {
|
|
total_size *= dim;
|
|
}
|
|
|
|
// Determine number of scales/zero_points needed
|
|
const param_size = if (per_channel)
|
|
shape[channel_dim.?]
|
|
else
|
|
1;
|
|
|
|
// Allocate memory
|
|
const data = try allocator.alloc(IntType, total_size);
|
|
errdefer allocator.free(data);
|
|
|
|
const scale = try allocator.alloc(f32, param_size);
|
|
errdefer allocator.free(scale);
|
|
|
|
const zero_point = try allocator.alloc(IntType, param_size);
|
|
errdefer allocator.free(zero_point);
|
|
|
|
// Calculate quantization range
|
|
const qmin: IntType = 0;
|
|
const qmax: IntType = (1 << bits) - 1;
|
|
|
|
// Create shape copy
|
|
const shape_copy = try allocator.dupe(usize, shape);
|
|
errdefer allocator.free(shape_copy);
|
|
|
|
return Self{
|
|
.data = data,
|
|
.shape = shape_copy,
|
|
.scale = scale,
|
|
.zero_point = zero_point,
|
|
.per_channel = per_channel,
|
|
.qmin = qmin,
|
|
.qmax = qmax,
|
|
.channel_dim = channel_dim,
|
|
.allocator = allocator,
|
|
};
|
|
}
|
|
|
|
// Free allocated memory
|
|
pub fn deinit(self: *Self) void {
|
|
self.allocator.free(self.data);
|
|
self.allocator.free(self.scale);
|
|
self.allocator.free(self.zero_point);
|
|
self.allocator.free(self.shape);
|
|
}
|
|
};
|
|
}
|
|
|
|
// Quantize a floating-point tensor to integer precision
|
|
pub fn quantize(
|
|
tensor: anytype,
|
|
config: QuantizationConfig,
|
|
allocator: std.mem.Allocator,
|
|
) !QuantizedTensor(
|
|
@TypeOf(tensor.data[0]),
|
|
config.bits,
|
|
) {
|
|
const T = @TypeOf(tensor.data[0]);
|
|
|
|
// Validate input
|
|
if (config.bits > 16) {
|
|
return error.UnsupportedQuantizationBits;
|
|
}
|
|
|
|
if (config.granularity == .per_channel and config.calibration != .minmax) {
|
|
return error.UnsupportedCombination;
|
|
}
|
|
|
|
// Create quantized tensor
|
|
var channel_dim: ?usize = null;
|
|
if (config.granularity == .per_channel) {
|
|
// For per-channel quantization, use dimension 0 for vectors,
|
|
// dimension 1 for matrices (assuming CHW layout)
|
|
channel_dim = if (tensor.shape.len == 1) 0 else 1;
|
|
}
|
|
|
|
var qtensor = try QuantizedTensor(T, config.bits).init(
|
|
allocator,
|
|
tensor.shape,
|
|
config.granularity == .per_channel,
|
|
channel_dim,
|
|
);
|
|
errdefer qtensor.deinit();
|
|
|
|
// Different calibration strategies
|
|
switch (config.calibration) {
|
|
.minmax => try calibrateMinMax(&qtensor, tensor, config),
|
|
.entropy => try calibrateEntropy(&qtensor, tensor, config),
|
|
.percentile => try calibratePercentile(&qtensor, tensor, config),
|
|
}
|
|
|
|
// Perform actual quantization
|
|
try quantizeTensor(&qtensor, tensor, config);
|
|
|
|
return qtensor;
|
|
}
|
|
|
|
// Dequantize a tensor back to floating point
|
|
pub fn dequantize(
|
|
qtensor: anytype,
|
|
allocator: std.mem.Allocator,
|
|
) !Tensor(@TypeOf(qtensor).OriginalType, qtensor.shape.len) {
|
|
const T = @TypeOf(qtensor).OriginalType;
|
|
|
|
// Create tensor to hold dequantized values
|
|
var tensor = try Tensor(T, qtensor.shape.len).init(
|
|
allocator,
|
|
qtensor.shape,
|
|
);
|
|
errdefer tensor.deinit();
|
|
|
|
// Dequantize values
|
|
if (qtensor.per_channel) {
|
|
const channel_dim = qtensor.channel_dim.?;
|
|
const channels = qtensor.shape[channel_dim];
|
|
|
|
// Calculate strides for traversing channels
|
|
var strides: []usize = try allocator.alloc(usize, qtensor.shape.len);
|
|
defer allocator.free(strides);
|
|
|
|
var stride: usize = 1;
|
|
var i: usize = qtensor.shape.len;
|
|
while (i > 0) {
|
|
i -= 1;
|
|
strides[i] = stride;
|
|
stride *= qtensor.shape[i];
|
|
}
|
|
|
|
// Dequantize each element based on its channel
|
|
for (0..tensor.data.len) |idx| {
|
|
const channel_idx = (idx / strides[channel_dim]) % channels;
|
|
const scale = qtensor.scale[channel_idx];
|
|
const zero_point = qtensor.zero_point[channel_idx];
|
|
|
|
tensor.data[idx] = @floatCast(T,
|
|
@intToFloat(f32, qtensor.data[idx] - zero_point) * scale
|
|
);
|
|
}
|
|
} else {
|
|
// Per-tensor dequantization (simpler)
|
|
const scale = qtensor.scale[0];
|
|
const zero_point = qtensor.zero_point[0];
|
|
|
|
for (0..tensor.data.len) |i| {
|
|
tensor.data[i] = @floatCast(T,
|
|
@intToFloat(f32, qtensor.data[i] - zero_point) * scale
|
|
);
|
|
}
|
|
}
|
|
|
|
return tensor;
|
|
}
|
|
|
|
// Calibrate using simple min/max strategy
|
|
fn calibrateMinMax(
|
|
qtensor: anytype,
|
|
tensor: anytype,
|
|
config: QuantizationConfig,
|
|
) !void {
|
|
if (config.granularity == .per_tensor) {
|
|
// Find min/max across entire tensor
|
|
var min_val: f32 = std.math.inf_f32;
|
|
var max_val: f32 = -std.math.inf_f32;
|
|
|
|
for (tensor.data) |val| {
|
|
const fval = @floatCast(f32, val);
|
|
min_val = @min(min_val, fval);
|
|
max_val = @max(max_val, fval);
|
|
}
|
|
|
|
// Handle symmetric quantization
|
|
if (config.scheme == .symmetric) {
|
|
const abs_max = @max(@abs(min_val), @abs(max_val));
|
|
min_val = -abs_max;
|
|
max_val = abs_max;
|
|
}
|
|
|
|
// Calculate scale and zero_point
|
|
const range = max_val - min_val;
|
|
qtensor.scale[0] = range / @intToFloat(f32, qtensor.qmax - qtensor.qmin);
|
|
|
|
if (config.scheme == .symmetric) {
|
|
qtensor.zero_point[0] = @divFloor(qtensor.qmax - qtensor.qmin, 2) + qtensor.qmin;
|
|
} else {
|
|
qtensor.zero_point[0] = @floatToInt(
|
|
@TypeOf(qtensor.zero_point[0]),
|
|
@round(qtensor.qmin - min_val / qtensor.scale[0])
|
|
);
|
|
}
|
|
} else {
|
|
// Per-channel quantization
|
|
// ... implementation details ...
|
|
}
|
|
}
|
|
|
|
// Perform actual quantization
|
|
fn quantizeTensor(
|
|
qtensor: anytype,
|
|
tensor: anytype,
|
|
config: QuantizationConfig,
|
|
) !void {
|
|
if (qtensor.per_channel) {
|
|
// Per-channel quantization
|
|
// ... implementation details ...
|
|
} else {
|
|
// Per-tensor quantization
|
|
const scale = qtensor.scale[0];
|
|
const zero_point = qtensor.zero_point[0];
|
|
const qmin = qtensor.qmin;
|
|
const qmax = qtensor.qmax;
|
|
|
|
for (0..tensor.data.len) |i| {
|
|
const val = @floatCast(f32, tensor.data[i]);
|
|
|
|
// Quantize: x_q = round(x / scale) + zero_point
|
|
var q_val = @floatToInt(
|
|
@TypeOf(qtensor.data[0]),
|
|
@round(val / scale) + @intToFloat(f32, zero_point)
|
|
);
|
|
|
|
// Clamp to quantization range
|
|
q_val = @max(@min(q_val, qmax), qmin);
|
|
|
|
qtensor.data[i] = q_val;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Quantization Features:**
|
|
|
|
1. **Multiple Precision Options**
|
|
- 8-bit quantization for maximum throughput
|
|
- 4-bit quantization for model compression
|
|
- 3-bit quantization for extreme size reduction
|
|
- FP16 quantization for memory bandwidth reduction with minimal accuracy loss
|
|
|
|
2. **Flexible Quantization Schemes**
|
|
- Symmetric quantization for simpler arithmetic
|
|
- Asymmetric quantization for better range utilization
|
|
- Per-tensor quantization for speed
|
|
- Per-channel quantization for accuracy
|
|
|
|
3. **Advanced Calibration Methods**
|
|
- Min/max calibration for simplicity
|
|
- Entropy-based calibration for better distribution representation
|
|
- Percentile-based calibration for outlier handling
|
|
|
|
4. **Mixed-Precision Execution**
|
|
- Critical layers in higher precision for accuracy
|
|
- Non-critical layers in lower precision for speed
|
|
- Automatic precision selection based on sensitivity analysis
|
|
|
|
5. **Hardware Acceleration**
|
|
- Optimized integer SIMD operations for quantized execution
|
|
- Specialized kernels for common quantized operations
|
|
- Hardware-specific optimizations for quantized compute
|
|
|
|
## Platform-Specific Optimizations
|
|
|
|
### Apple Silicon (M-Series)
|
|
|
|
The DeepSeek V3 Zig implementation is highly optimized for Apple Silicon's unique architecture:
|
|
|
|
1. **Metal Performance Shaders (MPS) Integration**
|
|
- Direct integration with Apple's Metal Performance Shaders for matrix operations
|
|
- Custom Metal compute kernels optimized for M-series chips
|
|
- Efficient memory sharing between CPU and GPU with zero-copy transfers
|
|
|
|
2. **Tensor Core Utilization**
|
|
- Leveraging Matrix multiplication units in M-series chips
|
|
- Mixed-precision operations optimized for Apple Silicon
|
|
- Native FP16 support for improved throughput
|
|
|
|
3. **AMX Instruction Set Access**
|
|
- Direct use of Apple Matrix extensions for accelerated linear algebra
|
|
- Low-level optimization of critical matrix operations
|
|
- Custom assembly routines for maximum performance
|
|
|
|
4. **Memory Bandwidth Optimization**
|
|
- Unified memory architecture exploitation
|
|
- Cache-friendly memory access patterns
|
|
- Optimal tile sizes for M-series cache hierarchy
|
|
|
|
5. **Power Efficiency Tuning**
|
|
- Dynamic performance/power scaling
|
|
- Efficient core utilization across P and E cores
|
|
- Background inference optimizations
|
|
|
|
### x86_64 Architecture
|
|
|
|
For x86_64 platforms, our implementation focuses on leveraging the latest instruction sets:
|
|
|
|
1. **AVX-512 Vectorization**
|
|
- Full utilization of 512-bit vector operations
|
|
- Masked operations for efficient boundary handling
|
|
- FMA instruction usage for maximum throughput
|
|
|
|
2. **Cache-Friendly Memory Layouts**
|
|
- Cache line aligned data structures
|
|
- Blocked algorithms optimized for typical L1/L2/L3 cache sizes
|
|
- Software prefetching for critical data paths
|
|
|
|
3. **Thread Pool Optimization**
|
|
- Work-stealing scheduler for balanced multicore utilization
|
|
- NUMA-aware memory allocation and thread assignment
|
|
- Adaptive parallelism based on available cores
|
|
|
|
4. **Dynamic Dispatch**
|
|
- Runtime CPU feature detection
|
|
- Specialized code paths for different instruction sets
|
|
- Fallback implementations for compatibility
|
|
|
|
### NVIDIA GPUs
|
|
|
|
NVIDIA GPU acceleration is implemented through an efficient CUDA integration:
|
|
|
|
1. **CUDA Integration via FFI**
|
|
- Zero-overhead bindings to CUDA runtime
|
|
- Asynchronous kernel execution and memory transfers
|
|
- Efficient stream management for overlapping operations
|
|
|
|
2. **Custom CUDA Kernels**
|
|
- Specialized kernels for attention mechanisms
|
|
- Optimized matrix multiplication for transformer layers
|
|
- Fused operations for reduced kernel launch overhead
|
|
|
|
3. **Memory Management**
|
|
- Pinned memory for efficient transfers
|
|
- Memory pool for reduced allocation overhead
|
|
- Smart prefetching for predictable memory access patterns
|
|
|
|
4. **Tensor Core Utilization**
|
|
- Mixed-precision operations using TensorCores
|
|
- Automatic kernel selection for tensor-core eligible operations
|
|
- Tensor Core compatible memory layouts
|
|
|
|
## Development Roadmap
|
|
|
|
### Phase 1: Core Infrastructure
|
|
|
|
The initial phase focuses on establishing the foundational components:
|
|
|
|
- **Memory Management System**
|
|
- Custom tensor allocator implementation
|
|
- Arena-based allocation strategies
|
|
- Error handling framework
|
|
|
|
- **Tensor Implementation**
|
|
- Basic tensor operations and utilities
|
|
- SIMD-accelerated implementations
|
|
- Platform detection and optimization
|
|
|
|
- **Computation Backend Interfaces**
|
|
- Abstract backend interfaces
|
|
- CPU backend implementation
|
|
- Initial Metal backend for Apple Silicon
|
|
|
|
- **Error Handling Framework**
|
|
- Robust error propagation
|
|
- Detailed error reporting
|
|
- Resource cleanup guarantees
|
|
|
|
### Phase 2: Model Architecture
|
|
|
|
Building on the infrastructure, we implement the core model components:
|
|
|
|
- **Transformer Layers**
|
|
- Multi-head attention implementation
|
|
- Feed-forward networks
|
|
- Layer normalization
|
|
|
|
- **Attention Mechanisms**
|
|
- Standard attention implementation
|
|
- Flash attention optimizations
|
|
- Memory-efficient attention variants
|
|
|
|
- **Mixture of Experts**
|
|
- Router implementation
|
|
- Parallel expert execution
|
|
- Load balancing mechanisms
|
|
|
|
- **Embedding Systems**
|
|
- Token embeddings
|
|
- Position embeddings
|
|
- Rotary position embeddings
|
|
|
|
### Phase 3: Backend Integration
|
|
|
|
This phase extends compute capabilities across different hardware:
|
|
|
|
- **CPU Backend**
|
|
- AVX-512 optimizations
|
|
- Thread pool implementation
|
|
- Cache-optimized algorithms
|
|
|
|
- **Metal Backend**
|
|
- Complete Metal shader library
|
|
- Apple Neural Engine integration
|
|
- M-series specific optimizations
|
|
|
|
- **CUDA Backend**
|
|
- NVIDIA GPU support
|
|
- Tensor Core optimizations
|
|
- Multi-GPU scaling
|
|
|
|
- **Vulkan Backend**
|
|
- Cross-platform GPU support
|
|
- AMD GPU optimizations
|
|
- Intel GPU support
|
|
|
|
### Phase 4: Inference Pipeline
|
|
|
|
Creating the end-to-end inference system:
|
|
|
|
- **Model Loading**
|
|
- SafeTensors format support
|
|
- Checkpoint loading
|
|
- Weight quantization
|
|
|
|
- **Tokenization**
|
|
- Efficient tokenizer implementation
|
|
- Streaming tokenization
|
|
- Special token handling
|
|
|
|
- **Generation Strategies**
|
|
- Sampling methods implementation
|
|
- Beam search
|
|
- Speculative decoding
|
|
|
|
- **Output Processing**
|
|
- Token streaming
|
|
- Stop sequence handling
|
|
- Result formatting
|
|
|
|
### Phase 5: Optimization
|
|
|
|
Comprehensive optimization across the entire stack:
|
|
|
|
- **Compile-Time Optimizations**
|
|
- Template specialization
|
|
- Kernel generation
|
|
- Custom data layouts
|
|
|
|
- **Runtime Optimizations**
|
|
- Dynamic kernel selection
|
|
- Adaptive compute strategies
|
|
- Memory access optimizations
|
|
|
|
- **Architecture-Specific Tuning**
|
|
- Platform-specific parameter tuning
|
|
- Hardware-specific kernel variants
|
|
- Feature detection and adaptation
|
|
|
|
- **Quantization Framework**
|
|
- 8-bit quantization
|
|
- 4-bit quantization
|
|
- Mixed precision execution
|
|
|
|
### Phase 6: Testing and Benchmarking
|
|
|
|
Ensuring correctness and measuring performance:
|
|
|
|
- **Comprehensive Test Suite**
|
|
- Unit tests for all components
|
|
- Integration tests for end-to-end validation
|
|
- Conformance tests against reference implementation
|
|
|
|
- **Benchmarking Framework**
|
|
- Performance measurement tools
|
|
- Comparison with PyTorch implementation
|
|
- Memory usage analysis
|
|
|
|
- **Platform Benchmarks**
|
|
- Apple Silicon performance
|
|
- x86_64 performance
|
|
- NVIDIA GPU performance
|
|
|
|
- **Fine-Tuning**
|
|
- Performance bottleneck identification
|
|
- Targeted optimizations
|
|
- Final parameter tuning
|
|
|
|
## Why DeepSeek V3 in Zig?
|
|
|
|
The migration of DeepSeek V3 to Zig represents a significant advancement in language model implementation. By leveraging Zig's unique features, particularly compile-time metaprogramming and fine-grained memory control, we aim to create a highly optimized implementation that outperforms the original Python/PyTorch version significantly while maintaining flexibility and ease of use.
|
|
|
|
Key advantages of the Zig implementation include:
|
|
|
|
1. **Superior Performance**
|
|
- Compile-time specialization eliminates runtime overhead
|
|
- Direct hardware access for maximum efficiency
|
|
- Zero-cost abstractions for clean yet fast code
|
|
|
|
2. **Memory Efficiency**
|
|
- Explicit allocation strategies tailored to LLM workloads
|
|
- Reduced memory fragmentation
|
|
- Lower overall memory footprint
|
|
|
|
3. **Reliability**
|
|
- Comprehensive error handling
|
|
- No runtime exceptions
|
|
- Deterministic resource cleanup
|
|
|
|
4. **Portability**
|
|
- Cross-platform support with optimized backends
|
|
- Consistent behavior across environments
|
|
- Single codebase for all target platforms
|
|
|
|
The resulting system will be particularly well-suited for deployment on resource-constrained devices and will provide superior performance on all platforms. This architectural approach sets the foundation for future innovations in large language model deployment. |