DeepSeek V3 in Zig: `DeepZig V3` Proposal

## Overview This document outlines the initial architecture proposal for implementing DeepSeek V3 in the Zig programming language. The focus is on leveraging Zig's unique features to create a high-performance, memory-efficient, and robust implementation of the DeepSeek V3 architecture. 1. **Superior Performance**: Leverage Zig's compile-time metaprogramming, SIMD vectorization, and low-level control to achieve optimal performance across platforms 2. **Memory Efficiency**: Utilize Zig's explicit allocator system and arena allocation patterns for precise resource management 3. **Concurrent Processing**: Implement efficient parallel execution using Zig's advanced async/await framework and evented I/O 4. **Type Safety & Reliability**: Employ Zig's strong type system, comptime checks, and explicit error handling to prevent runtime errors 5. **Cross-Platform Support**: Create a portable implementation with seamless support across architectures (x86_64, ARM64, etc.) ## Table of Contents 1. [Overview](#overview) 2. [System Architecture](#system-architecture) 3. [Component Design](#component-design) 1. [Core System](#core-system) 1. [Memory Management System](#memory-management-system) 2. [Tensor Implementation](#tensor-implementation) 3. [Error Handling Framework](#error-handling-framework) 4. [Concurrency Model](#concurrency-model) 2. [Model Architecture](#model-architecture) 1. [Transformer Core](#transformer-core) 2. [Attention Mechanisms](#attention-mechanisms) 3. [Mixture of Experts (MoE)](#mixture-of-experts-moe) 3. [Computation Backend](#computation-backend) 1. [Backend Interface](#backend-interface) 2. [Metal Integration for Apple Silicon](#metal-integration-for-apple-silicon) 4. [Inference Pipeline](#inference-pipeline) 1. [Model Loading](#model-loading) 2. [Generation Strategies](#generation-strategies) 5. [Optimization Layer](#optimization-layer) 1. [Compile-Time Optimizations](#compile-time-optimizations) 2. [Quantization Framework](#quantization-framework) 4. [Platform-Specific Optimizations](#platform-specific-optimizations) 5. [Development Roadmap](#development-roadmap) 6. [Why Propose DeepSeek V3 in Zig?](#why-propose-deepseek-v3-in-zig) ## System Architecture ### High-Level Component Overview The DeepSeek V3 Zig implementation consists of the following major components: ``` DeepSeek V3 Zig │ ├── Core │ ├── Memory Management System │ │ ├── Custom Allocator Framework │ │ ├── Arena Allocation Strategy │ │ └── Memory Pool Implementation │ ├── Tensor Implementation │ │ ├── SIMD-Optimized Operations │ │ ├── Compile-Time Specialization │ │ └── Zero-Cost Abstractions │ └── Error Handling Framework │ ├── Comprehensive Error Types │ └── Performance-Optimized Error Paths │ ├── Model Architecture │ ├── Transformer Layers │ │ ├── Comptime-Generated Layer Variants │ │ └── Optimized Forward Pass │ ├── Attention Mechanisms │ │ ├── Vectorized Multi-Head Attention │ │ └── Efficient KV-Cache Management │ ├── MoE (Mixture of Experts) │ │ ├── Parallel Expert Execution │ │ └── Optimized Router Implementation │ └── Embedding Systems │ ├── Memory-Efficient Token Embeddings │ └── Positional Encoding Optimizations │ ├── Computation Backend │ ├── CPU Implementation │ │ ├── SIMD Vectorization │ │ └── Multi-Threaded Execution │ ├── GPU Integration (Optional) │ │ ├── CUDA Support (NVIDIA) │ │ ├── Metal Support (Apple) │ │ └── ROCm Support (AMD) │ └── Backend Interface Layer │ ├── Zero-Cost Abstraction │ └── Compile-Time Dispatch │ ├── Inference Pipeline │ ├── Model Loading & Weight Management │ ├── Tokenization System │ ├── Advanced Generation Strategies │ │ ├── Speculative Decoding │ │ └── Beam Search │ └── Streaming Output Processing │ └── Optimization Layer ├── Compile-Time Specialization │ ├── Architecture-Specific Code Gen │ └── Tensor Operation Optimization ├── Runtime Performance Tuning │ ├── Cache-Aware Memory Layout │ └── Workload Balancing └── Quantization Framework ├── Mixed-Precision Support └── Hardware-Accelerated Execution ``` ## Detailed Component Design ### 1. Core Systems #### 1.1 Memory Management System Memory management in Zig represents a significant advancement over Python's garbage collection. Zig provides explicit allocator interfaces that give fine-grained control over memory allocation and deallocation strategies: ```zig const std = @import("std"); // Define a custom tensor allocator that combines multiple strategies pub const TensorAllocator = struct { // Use arena for temporary tensor operations during inference arena: std.heap.ArenaAllocator, // Use a fixed buffer for small activations fixed_buffer: [1024 * 1024]u8 = undefined, fixed_allocator: std.heap.FixedBufferAllocator, // General purpose allocator for long-lived objects gpa: std.heap.GeneralPurposeAllocator(.{}), pub fn init(backing_allocator: std.mem.Allocator) !*TensorAllocator { var self = try backing_allocator.create(TensorAllocator); self.* = .{ .arena = std.heap.ArenaAllocator.init(backing_allocator), .fixed_allocator = std.heap.FixedBufferAllocator.init(&self.fixed_buffer), .gpa = std.heap.GeneralPurposeAllocator(.{}){}, }; return self; } pub fn deinit(self: *TensorAllocator) void { self.arena.deinit(); _ = self.gpa.deinit(); // backing allocator will free self } // Get the right allocator for specific tensor use cases pub fn temporaryAllocator(self: *TensorAllocator) std.mem.Allocator { return self.arena.allocator(); } pub fn smallActivationAllocator(self: *TensorAllocator) std.mem.Allocator { return self.fixed_allocator.allocator(); } pub fn persistentAllocator(self: *TensorAllocator) std.mem.Allocator { return self.gpa.allocator(); } }; // Inference function example with specialized memory allocation pub fn performInference(model: *Model, input: Tensor) !Tensor { var allocator = try TensorAllocator.init(std.heap.page_allocator); defer allocator.deinit(); // Use different allocators for different tensor operations var activations = try computeActivations(model, input, allocator.temporaryAllocator()); var weights = try loadModelWeights(model, allocator.persistentAllocator()); // Results are automatically freed when the arena is deinitialized return try generateOutput(activations, weights, allocator.temporaryAllocator()); } ``` **Key Features:** - **Tiered Allocation Strategy**: Different allocators for different memory usage patterns - **Arena Allocation**: Bulk allocation and freeing for intermediate tensors, dramatically reducing memory management overhead - **Fixed Buffer Allocation**: Zero-heap-allocation path for small, predictable tensor operations - **Memory Pool Implementation**: Custom pools for tensor data to minimize fragmentation - **Explicit Error Handling**: All allocation failures are explicitly handled with Zig's error system #### 1.2 Tensor Implementation Tensors are the fundamental data structure for DeepSeek. Our implementation leverages Zig's advanced compile-time features, SIMD capabilities, and memory layout optimizations for maximum performance: ```zig pub fn Tensor(comptime DataType: type, comptime dimensions: usize) type { return struct { const Self = @This(); data: []DataType, shape: [dimensions]usize, strides: [dimensions]usize, allocator: std.mem.Allocator, is_contiguous: bool, // Vector types for SIMD operations based on hardware capabilities pub const VecType = switch (DataType) { f32 => if (@hasDecl(builtin, "cpu") and @hasDecl(builtin.cpu, "avx")) @Vector(8, f32) // AVX else if (@hasDecl(builtin, "cpu") and @hasDecl(builtin.cpu, "sse")) @Vector(4, f32) // SSE else @Vector(4, f32), // Fallback f16 => @Vector(8, f16), i32 => @Vector(8, i32), i8 => @Vector(16, i8), else => @compileError("Unsupported data type for SIMD"), }; // Number of elements in the SIMD vector pub const vec_width = @sizeOf(VecType) / @sizeOf(DataType); pub fn init(allocator: std.mem.Allocator, shape: [dimensions]usize) !Self { var strides: [dimensions]usize = undefined; var total_size: usize = 1; // Calculate C-contiguous (row-major) strides for optimal memory access var i: usize = dimensions; while (i > 0) { i -= 1; strides[i] = total_size; total_size *= shape[i]; } // Align memory for optimal SIMD access const alignment = @alignOf(VecType); const data = try allocator.alignedAlloc(DataType, alignment, total_size); return Self{ .data = data, .shape = shape, .strides = strides, .allocator = allocator, .is_contiguous = true, }; } pub fn deinit(self: *Self) void { self.allocator.free(self.data); } // Optimized SIMD matrix multiplication for 2D tensors pub fn matmul(self: *Self, other: *Self, allocator: std.mem.Allocator) !Self { std.debug.assert(dimensions == 2 and other.dimensions == 2); std.debug.assert(self.shape[1] == other.shape[0]); const M = self.shape[0]; const K = self.shape[1]; const N = other.shape[1]; var result = try Self.init(allocator, .{ M, N }); // Zero initialization @memset(result.data, 0); // Check if both tensors are contiguous for optimal performance if (self.is_contiguous and other.is_contiguous) { // Cache-aware blocked matrix multiplication with SIMD const block_size = 64; // Tuned for L1 cache // For each block var i: usize = 0; while (i < M) : (i += block_size) { const i_end = @min(i + block_size, M); var j: usize = 0; while (j < N) : (j += block_size) { const j_end = @min(j + block_size, N); var k: usize = 0; while (k < K) : (k += block_size) { const k_end = @min(k + block_size, K); // Process each block var ii: usize = i; while (ii < i_end) : (ii += 1) { var jj: usize = j; while (jj < j_end) : (jj += vec_width) { // SIMD-optimized inner loop if (jj + vec_width <= j_end) { var sum: VecType = @splat(0); var kk: usize = k; while (kk < k_end) : (kk += 1) { const a_val = self.data[ii * K + kk]; const b_vec: VecType = blk: { var tmp: [vec_width]DataType = undefined; for (0..vec_width) |v| { if (jj + v < j_end) { tmp[v] = other.data[kk * N + (jj + v)]; } else { tmp[v] = 0; } } break :blk tmp; }; sum += @splat(a_val) * b_vec; } // Store result for (0..vec_width) |v| { if (jj + v < j_end) { result.data[ii * N + (jj + v)] += sum[v]; } } } else { // Handle remaining columns (tail) while (jj < j_end) : (jj += 1) { var sum: DataType = 0; var kk: usize = k; while (kk < k_end) : (kk += 1) { sum += self.data[ii * K + kk] * other.data[kk * N + jj]; } result.data[ii * N + jj] += sum; } } } } } } } } else { // Fallback for non-contiguous tensors var i: usize = 0; while (i < M) : (i += 1) { var j: usize = 0; while (j < N) : (j += 1) { var sum: DataType = 0; var k: usize = 0; while (k < K) : (k += 1) { sum += self.at(.{i, k}) * other.at(.{k, j}); } try result.set(.{i, j}, sum); } } } return result; } // Access element at specific indices pub fn at(self: Self, indices: [dimensions]usize) DataType { var offset: usize = 0; inline for (0..dimensions) |i| { offset += indices[i] * self.strides[i]; } return self.data[offset]; } // Set element at specific indices pub fn set(self: *Self, indices: [dimensions]usize, value: DataType) !void { var offset: usize = 0; inline for (0..dimensions) |i| { offset += indices[i] * self.strides[i]; } self.data[offset] = value; } // Apply element-wise operations with SIMD acceleration pub fn map(self: Self, comptime op: fn (DataType) DataType, allocator: std.mem.Allocator) !Self { var result = try Self.init(allocator, self.shape); // Use SIMD operations for contiguous data if (self.is_contiguous) { var i: usize = 0; const vec_chunks = self.data.len / vec_width; // Process in SIMD chunks while (i < vec_chunks) : (i += 1) { const base_idx = i * vec_width; var vec: VecType = undefined; // Load vector for (0..vec_width) |j| { vec[j] = self.data[base_idx + j]; } // Apply operation on each vector element for (0..vec_width) |j| { vec[j] = op(vec[j]); } // Store result for (0..vec_width) |j| { result.data[base_idx + j] = vec[j]; } } // Process remaining elements const remaining_start = vec_chunks * vec_width; for (remaining_start..self.data.len) |j| { result.data[j] = op(self.data[j]); } } else { // Fallback for non-contiguous data var indices: [dimensions]usize = .{0} ** dimensions; var done = false; while (!done) { const val = self.at(indices); try result.set(indices, op(val)); // Increment indices var d = dimensions - 1; while (true) { indices[d] += 1; if (indices[d] < self.shape[d]) break; indices[d] = 0; if (d == 0) { done = true; break; } d -= 1; } } } return result; } }; } // Specialized tensor types for common uses const FloatTensor1D = Tensor(f32, 1); const FloatTensor2D = Tensor(f32, 2); const FloatTensor4D = Tensor(f32, 4); // Common for batch x height x width x channels const QuantizedTensor4D = Tensor(i8, 4); // For quantized operations ``` **Key Features:** - **Hardware-Aware SIMD Vectorization**: Automatically selects optimal vector width based on CPU capabilities (AVX, SSE) - **Cache-Optimized Algorithms**: Blocked matrix multiplication designed for L1/L2 cache efficiency - **Aligned Memory Allocation**: Ensures data is properly aligned for SIMD operations - **Specialized Tensor Types**: Pre-defined tensor configurations for common use cases - **Automatic Fallbacks**: Graceful degradation for non-contiguous tensors or unsupported operations - **Compile-Time Optimization**: Tensor dimensions and data types resolved at compile time for maximum performance - **Zero-Runtime Overhead**: SIMD operations with no dynamic dispatch or virtual function calls #### 1.3 Error Handling Framework Zig's error handling system provides a powerful foundation for creating robust, high-performance software. Unlike exceptions in languages like C++ or Python, Zig's error handling is explicit and deterministic, making it particularly well-suited for large-scale machine learning applications: ```zig // Define a comprehensive set of potential errors with clear semantic meaning const ModelError = error{ ModelLoadFailed, InvalidDimension, InvalidShape, OutOfMemory, ComputeBackendError, InvalidWeight, UnsupportedOperation, UnsupportedDataType, DeviceNotAvailable, TensorShapeMismatch, QuantizationError, InvalidConfiguration, }; // Union error sets for comprehensive error handling const DeepSeekError = ModelError || TensorError || AllocationError || IoError; // Example function demonstrating Zig's error handling with defer for cleanup fn loadModel(allocator: std.mem.Allocator, path: []const u8) DeepSeekError!*Model { var file = try std.fs.cwd().openFile(path, .{}); defer file.close(); // Ensures file is closed even if an error occurs var buffer = std.ArrayList(u8).init(allocator); defer buffer.deinit(); // Clean up buffer regardless of success/failure try buffer.ensureTotalCapacity(file.getEndPos() catch return ModelError.ModelLoadFailed); const bytes_read = try file.readAll(buffer.items); if (bytes_read == 0) return ModelError.ModelLoadFailed; var model = try allocator.create(Model); errdefer allocator.destroy(model); // Only called if an error occurs after this point model.* = Model.init(allocator); errdefer model.deinit(); // Only called if an error occurs after this point // Parse weights and initialize model... if (!try parseWeights(model, buffer.items)) { return ModelError.InvalidWeight; } return model; } // Demonstrate error handling in caller code pub fn main() !void { var gpa = std.heap.GeneralPurposeAllocator(.{}){}; defer _ = gpa.deinit(); const allocator = gpa.allocator(); // Handle errors explicitly with try/catch blocks const model = loadModel(allocator, "model.bin") catch |err| { switch (err) { ModelError.ModelLoadFailed => { std.debug.print("Failed to load model file\n", .{}); return err; }, ModelError.InvalidWeight => { std.debug.print("Model contains invalid weights\n", .{}); return err; }, else => { std.debug.print("Unexpected error: {}\n", .{err}); return err; }, } }; defer model.deinit(); // Continue with model usage... } ``` **Key Features:** - **Explicit Error Types**: Clearly defined error sets that precisely describe what can go wrong - **No Exceptions**: Deterministic error handling with no hidden control flow - **Resource Safety**: Automatic cleanup with `defer` and `errdefer` ensures resources are properly managed - **Performance Optimization**: Error handling doesn't rely on stack unwinding or dynamic dispatch - **Composable Error Sets**: Error types can be combined using the `||` operator - **Try-Catch Blocks**: For selective error handling when needed - **Error Tracing**: Built-in error return trace capability for debugging #### 1.4 Concurrency Model Zig's concurrency model will be leveraged to parallelize computation-intensive operations in DeepSeek. Zig's async/await syntax provides a structured approach to concurrency without the overhead of traditional threading: ```zig const std = @import("std"); // Thread pool for CPU-bound parallel tasks pub const ComputeThreadPool = struct { pool: std.Thread.Pool, completion_count: std.atomic.Atomic(usize), pub fn init(thread_count: usize) !ComputeThreadPool { var pool: std.Thread.Pool = undefined; try pool.init(.{ .allocator = std.heap.c_allocator, .n_jobs = thread_count, }); return ComputeThreadPool{ .pool = pool, .completion_count = std.atomic.Atomic(usize).init(0), }; } pub fn deinit(self: *ComputeThreadPool) void { self.pool.deinit(); } // Execute a compute task asynchronously pub fn compute(self: *ComputeThreadPool, task: *const fn(*anyopaque) void, context: *anyopaque) !void { try self.pool.spawn(task, context); } // Wait for all compute tasks to complete pub fn waitAll(self: *ComputeThreadPool) void { // Process tasks in the event loop until all are complete while (self.completion_count.load(.Acquire) > 0) { std.time.sleep(1 * std.time.millisecond); } } }; // Parallel tensor operation example with async/await pub fn parallelMatMul(allocator: std.mem.Allocator, a: *Tensor(f32, 2), b: *Tensor(f32, 2)) !*Tensor(f32, 2) { const M = a.shape[0]; const K = a.shape[1]; const N = b.shape[1]; var result = try Tensor(f32, 2).init(allocator, .{M, N}); errdefer result.deinit(); @memset(result.data, 0); // Create thread pool with optimal number of threads const cpu_count = try std.Thread.getCpuCount(); var thread_pool = try ComputeThreadPool.init(cpu_count); defer thread_pool.deinit(); // Split work based on number of available cores const rows_per_thread = (M + cpu_count - 1) / cpu_count; // Define the worker task const WorkContext = struct { a: *const Tensor(f32, 2), b: *const Tensor(f32, 2), result: *Tensor(f32, 2), start_row: usize, end_row: usize, thread_pool: *ComputeThreadPool, }; // Worker function for computing a subset of rows const workerFn = struct { fn compute(context_ptr: *anyopaque) void { const context = @ptrCast(*WorkContext, @alignCast(@alignOf(WorkContext), context_ptr)); const a = context.a; const b = context.b; const result = context.result; const start_row = context.start_row; const end_row = context.end_row; // Compute assigned rows for (start_row..end_row) |i| { if (i >= a.shape[0]) break; for (0..b.shape[1]) |j| { var sum: f32 = 0.0; for (0..a.shape[1]) |k| { sum += a.at(.{i, k}) * b.at(.{k, j}); } result.set(.{i, j}, sum) catch {}; } } // Mark task as complete _ = context.thread_pool.completion_count.fetchSub(1, .Release); } }; // Spawn workers for each section of the matrix for (0..cpu_count) |i| { const start_row = i * rows_per_thread; const end_row = std.math.min(start_row + rows_per_thread, M); if (start_row >= M) break; // Create context for this worker var context = try allocator.create(WorkContext); context.* = .{ .a = a, .b = b, .result = result, .start_row = start_row, .end_row = end_row, .thread_pool = &thread_pool, }; // Increment completion counter before spawning task _ = thread_pool.completion_count.fetchAdd(1, .Release); // Spawn the worker task try thread_pool.compute(workerFn.compute, context); } // Wait for all tasks to complete thread_pool.waitAll(); return result; } ``` **Key Features:** - **Thread Pool Management**: Efficient worker thread allocation based on available CPU cores - **Work Partitioning**: Automatic division of work across available cores - **Minimal Synchronization**: Lock-free atomic counters for synchronization when needed - **Resource Safety**: Proper cleanup with `defer` and `errdefer` even during concurrent execution - **Structured Concurrency**: Clear task dependencies and lifecycle management - **Zero Runtime Overhead**: No garbage collection or runtime dependencies ### 2. Model Architecture #### 2.1 Transformer Core The transformer architecture is the foundation of DeepSeek V3. Our Zig implementation will leverage compile-time metaprogramming and advanced memory optimizations for maximum performance: ```zig const std = @import("std"); // Precomputed type variants for different data precisions pub const DataType = enum { f32, // 32-bit floating point (for debugging/development) bf16, // BFloat16 (for training/default inference) f16, // Float16 (for hardware with native f16 support) i8, // 8-bit integer (for quantized inference) i4, // 4-bit integer (for extreme quantization) }; // Configuration struct with default values matching DeepSeek V3 pub const ModelArgs = struct { // Core model parameters max_batch_size: usize = 8, max_seq_len: usize = 4096 * 4, data_type: DataType = .bf16, vocab_size: usize = 102400, dim: usize = 2048, inter_dim: usize = 10944, moe_inter_dim: usize = 1408, n_layers: usize = 27, n_dense_layers: usize = 1, n_heads: usize = 16, // MoE configuration n_routed_experts: usize = 64, n_shared_experts: usize = 2, n_activated_experts: usize = 6, n_expert_groups: usize = 1, n_limited_groups: usize = 1, score_func: enum { softmax, sigmoid } = .softmax, route_scale: f32 = 1.0, // MLA configuration q_lora_rank: usize = 0, kv_lora_rank: usize = 512, qk_nope_head_dim: usize = 128, qk_rope_head_dim: usize = 64, v_head_dim: usize = 128, // Positional encoding original_seq_len: usize = 4096, rope_theta: f32 = 10000.0, rope_factor: f32 = 40, beta_fast: usize = 32, beta_slow: usize = 1, mscale: f32 = 1.0, // Runtime options use_flash_attention: bool = true, // Use optimized attention implementation use_parallel_experts: bool = true, // Run experts in parallel max_token_limit: ?usize = null, // Optional token generation limit // Generate optimized implementations based on config parameters pub fn getModelType(self: @This()) type { return struct { const ModelType = @This(); const config = self; // Select optimal types based on data_type pub const StorageType = switch (config.data_type) { .f32 => f32, .bf16 => std.packed_bf16, .f16 => f16, .i8 => i8, .i4 => i4, }; // Define tensor types for different dimensions pub const WeightTensor = Tensor(StorageType, 2); pub const ActivationTensor = Tensor(f32, 3); // Always use f32 for activations pub const EmbeddingTensor = Tensor(StorageType, 2); pub const KVCacheTensor = Tensor(f32, 4); // [batch, seq_len, heads, dim] // Generate layer configuration pub const layer_config = struct { pub const head_dim = (config.dim / config.n_heads); pub const moe_layers_start = config.n_dense_layers; }; }; } }; // Main transformer model implementation pub fn TransformerModel(comptime args: ModelArgs) type { // Use comptime to generate a specialized model implementation based on args return struct { const Self = @This(); const ModelType = args.getModelType(); // Model components allocator: std.mem.Allocator, embedding: Embedding(args), layers: []TransformerBlock(args), norm: RMSNorm(args.dim), head: Linear(args.dim, args.vocab_size), freqs_cis: Tensor(f32, 3), // [max_seq_len, 2, qk_rope_head_dim] // KV cache for optimized inference kv_cache: ?ModelType.KVCacheTensor, pub fn init(allocator: std.mem.Allocator) !Self { // Initialize components var embedding = try Embedding(args).init(allocator); errdefer embedding.deinit(); var layers = try allocator.alloc(TransformerBlock(args), args.n_layers); errdefer allocator.free(layers); // Create layers with appropriate configurations for (layers, 0..) |*layer, i| { const is_moe = i >= args.n_dense_layers; layer.* = try TransformerBlock(args).init(allocator, i, is_moe); } var norm = try RMSNorm(args.dim).init(allocator); errdefer norm.deinit(); var head = try Linear(args.dim, args.vocab_size).init(allocator, false); errdefer head.deinit(); // Precompute positional encoding frequencies var freqs_cis = try precomputeFreqsCis(allocator, args); return Self{ .allocator = allocator, .embedding = embedding, .layers = layers, .norm = norm, .head = head, .freqs_cis = freqs_cis, .kv_cache = null, }; } pub fn deinit(self: *Self) void { self.embedding.deinit(); for (self.layers) |*layer| { layer.deinit(); } self.allocator.free(self.layers); self.norm.deinit(); self.head.deinit(); self.freqs_cis.deinit(); if (self.kv_cache) |*cache| { cache.deinit(); } } // Initialize KV cache for efficient inference pub fn initKVCache(self: *Self) !void { if (self.kv_cache != null) return; const batch_size = args.max_batch_size; const seq_len = args.max_seq_len; const n_heads = args.n_heads; const head_dim = ModelType.layer_config.head_dim; self.kv_cache = try ModelType.KVCacheTensor.init( self.allocator, .{batch_size, seq_len, n_heads, head_dim * 2} ); // Zero-initialize cache @memset(self.kv_cache.?.data, 0); } // Forward pass through the transformer model pub fn forward(self: *Self, token_ids: []const usize, start_pos: usize) !Tensor(f32, 2) { const batch_size = 1; // Currently supporting batch_size=1 for inference const seq_len = token_ids.len; // Create tensor from token_ids var input_tensor = try ModelType.ActivationTensor.init( self.allocator, .{batch_size, seq_len, args.dim} ); defer input_tensor.deinit(); // Get embeddings for input tokens try self.embedding.embed(token_ids, &input_tensor); // Process through each transformer layer var x = input_tensor; const freqs_cis_slice = try self.freqs_cis.slice(.{start_pos, 0, 0}, .{start_pos + seq_len, 2, args.qk_rope_head_dim}); // Create attention mask for causal attention var mask: ?Tensor(f32, 2) = null; if (seq_len > 1) { mask = try createCausalMask(self.allocator, seq_len); defer if (mask) |*m| m.deinit(); } // Process through transformer layers for (self.layers) |*layer| { x = try layer.forward(x, start_pos, freqs_cis_slice, mask); } // Apply final normalization var normalized = try self.norm.forward(x); defer normalized.deinit(); // Extract last token for prediction var last_token = try normalized.slice( .{0, seq_len - 1, 0}, .{batch_size, seq_len, args.dim} ); defer last_token.deinit(); // Project to vocabulary return try self.head.forward(last_token); } // Helper to create causal attention mask fn createCausalMask(allocator: std.mem.Allocator, seq_len: usize) !Tensor(f32, 2) { var mask = try Tensor(f32, 2).init(allocator, .{seq_len, seq_len}); errdefer mask.deinit(); for (0..seq_len) |i| { for (0..seq_len) |j| { const value: f32 = if (j <= i) 0.0 else -10000.0; try mask.set(.{i, j}, value); } } return mask; } }; } // Generate specialized transformer based on configuration pub fn createTransformer(allocator: std.mem.Allocator, args: ModelArgs) !*TransformerModel(args) { var model = try allocator.create(TransformerModel(args)); errdefer allocator.destroy(model); model.* = try TransformerModel(args).init(allocator); return model; } ``` This implementation leverages Zig's compile-time features to generate specialized model implementations based on the provided configuration parameters. The use of generic types and comptime evaluation allows for maximum performance optimization while maintaining code flexibility. #### 2.2 Attention Mechanism The Multi-Head Latent Attention (MLA) mechanism is a critical component of DeepSeek V3's performance. Our Zig implementation leverages compile-time specialization, SIMD vectorization, and cache-friendly algorithms for maximum efficiency: ```zig // Generic MLA implementation with compile-time specialization pub fn MLA(comptime args: ModelArgs) type { return struct { const Self = @This(); const ModelType = args.getModelType(); // Attention configuration dim: usize, n_heads: usize, head_dim: usize, q_lora_rank: usize, kv_lora_rank: usize, qk_nope_head_dim: usize, qk_rope_head_dim: usize, qk_head_dim: usize, v_head_dim: usize, softmax_scale: f32, use_flash_attention: bool, // Projection matrices allocator: std.mem.Allocator, wq: ?ColumnParallelLinear(args) = null, // Regular query projection wq_a: ?Linear(args.dim, args.q_lora_rank) = null, // LoRA decomposition q_norm: ?RMSNorm(args.q_lora_rank) = null, // LoRA normalization wq_b: ?ColumnParallelLinear(args) = null, // LoRA decomposition wkv_a: Linear(args.dim, args.kv_lora_rank + args.qk_rope_head_dim), kv_norm: RMSNorm(args.kv_lora_rank), wkv_b: ColumnParallelLinear(args), wo: RowParallelLinear(args), // KV caching - optimized for memory access patterns kv_cache: ?Tensor(f32, 4) = null, // [batch, seq_len, heads, head_dim*2] rope_cache: ?Tensor(f32, 3) = null, // [batch, seq_len, rope_dim] // Initialize MLA with appropriate configuration pub fn init(allocator: std.mem.Allocator) !Self { const head_dim = args.dim / args.n_heads; var softmax_scale = 1.0 / std.math.sqrt(@as(f32, @floatFromInt(args.qk_nope_head_dim + args.qk_rope_head_dim))); // Apply scaling for extended context if needed if (args.max_seq_len > args.original_seq_len) { const mscale = 0.1 * args.mscale * std.math.log(args.rope_factor) + 1.0; softmax_scale *= mscale * mscale; } // Initialize query projection (either direct or with LoRA) var wq: ?ColumnParallelLinear(args) = null; var wq_a: ?Linear(args.dim, args.q_lora_rank) = null; var q_norm: ?RMSNorm(args.q_lora_rank) = null; var wq_b: ?ColumnParallelLinear(args) = null; if (args.q_lora_rank == 0) { // Standard query projection wq = try ColumnParallelLinear(args).init( allocator, args.dim, args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim), false ); } else { // Low-rank adaptation for query wq_a = try Linear(args.dim, args.q_lora_rank).init(allocator, false); q_norm = try RMSNorm(args.q_lora_rank).init(allocator); wq_b = try ColumnParallelLinear(args).init( allocator, args.q_lora_rank, args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim), false ); } // Key-value projections var wkv_a = try Linear(args.dim, args.kv_lora_rank + args.qk_rope_head_dim).init(allocator, false); var kv_norm = try RMSNorm(args.kv_lora_rank).init(allocator); var wkv_b = try ColumnParallelLinear(args).init( allocator, args.kv_lora_rank, args.n_heads * (args.qk_nope_head_dim + args.v_head_dim), false ); // Output projection var wo = try RowParallelLinear(args).init( allocator, args.n_heads * args.v_head_dim, args.dim, false ); return Self{ .allocator = allocator, .dim = args.dim, .n_heads = args.n_heads, .head_dim = head_dim, .q_lora_rank = args.q_lora_rank, .kv_lora_rank = args.kv_lora_rank, .qk_nope_head_dim = args.qk_nope_head_dim, .qk_rope_head_dim = args.qk_rope_head_dim, .qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim, .v_head_dim = args.v_head_dim, .softmax_scale = softmax_scale, .use_flash_attention = args.use_flash_attention, .wq = wq, .wq_a = wq_a, .q_norm = q_norm, .wq_b = wq_b, .wkv_a = wkv_a, .kv_norm = kv_norm, .wkv_b = wkv_b, .wo = wo, }; } pub fn deinit(self: *Self) void { if (self.wq) |*w| w.deinit(); if (self.wq_a) |*w| w.deinit(); if (self.q_norm) |*n| n.deinit(); if (self.wq_b) |*w| w.deinit(); self.wkv_a.deinit(); self.kv_norm.deinit(); self.wkv_b.deinit(); self.wo.deinit(); if (self.kv_cache) |*cache| cache.deinit(); if (self.rope_cache) |*cache| cache.deinit(); } // Initialize KV cache for efficient inference pub fn initKVCache(self: *Self, batch_size: usize, seq_len: usize) !void { if (self.kv_cache != null) return; // Allocate KV cache self.kv_cache = try Tensor(f32, 4).init( self.allocator, .{batch_size, seq_len, self.n_heads, self.head_dim * 2} ); // Zero-initialize @memset(self.kv_cache.?.data, 0); // Allocate rotary positional encoding cache self.rope_cache = try Tensor(f32, 3).init( self.allocator, .{batch_size, seq_len, self.qk_rope_head_dim} ); @memset(self.rope_cache.?.data, 0); } // Forward pass implementation with multiple specialized paths pub fn forward( self: *Self, x: Tensor(f32, 3), start_pos: usize, freqs_cis: Tensor(f32, 3), mask: ?Tensor(f32, 2) ) !Tensor(f32, 3) { const batch_size = x.shape[0]; const seq_len = x.shape[1]; const end_pos = start_pos + seq_len; // Initialize KV cache if not already done if (start_pos > 0 and self.kv_cache == null) { try self.initKVCache(batch_size, args.max_seq_len); } // Compute query vectors var q: Tensor(f32, 4) = undefined; if (self.q_lora_rank == 0) { // Standard query projection var q_flat = try self.wq.?.forward(x); defer q_flat.deinit(); // Reshape to [batch, seq_len, heads, head_dim] q = try q_flat.reshape(.{batch_size, seq_len, self.n_heads, self.qk_head_dim}); } else { // Low-rank adaptation var q_a = try self.wq_a.?.forward(x); defer q_a.deinit(); var q_norm = try self.q_norm.?.forward(q_a); defer q_norm.deinit(); var q_b = try self.wq_b.?.forward(q_norm); defer q_b.deinit(); // Reshape q = try q_b.reshape(.{batch_size, seq_len, self.n_heads, self.qk_head_dim}); } defer q.deinit(); // Split query into regular and positional parts var q_slices = try q.split(3, .{self.qk_nope_head_dim, self.qk_rope_head_dim}); defer for (q_slices) |*slice| slice.deinit(); var q_nope = q_slices[0]; var q_pe = q_slices[1]; // Apply rotary embeddings to position-dependent part try applyRotaryEmbeddings(&q_pe, freqs_cis); // Compute key-value vectors var kv_raw = try self.wkv_a.forward(x); defer kv_raw.deinit(); // Split into KV features and positional features var kv_slices = try kv_raw.split(2, .{self.kv_lora_rank, self.qk_rope_head_dim}); defer for (kv_slices) |*slice| slice.deinit(); var kv_features = kv_slices[0]; var k_pe_features = kv_slices[1]; // Add batch and heads dimension to positional features var k_pe = try k_pe_features.reshape(.{batch_size, seq_len, 1, self.qk_rope_head_dim}); defer k_pe.deinit(); // Apply rotary embeddings try applyRotaryEmbeddings(&k_pe, freqs_cis); // Process main KV branch var kv_norm_features = try self.kv_norm.forward(kv_features); defer kv_norm_features.deinit(); var kv_proj = try self.wkv_b.forward(kv_norm_features); defer kv_proj.deinit(); // Reshape to separate K and V var kv_reshaped = try kv_proj.reshape( .{batch_size, seq_len, self.n_heads, self.qk_nope_head_dim + self.v_head_dim} ); defer kv_reshaped.deinit(); // Split into K and V var kv_parts = try kv_reshaped.split(3, .{self.qk_nope_head_dim, self.v_head_dim}); defer for (kv_parts) |*part| part.deinit(); var k_nope = kv_parts[0]; var v = kv_parts[1]; // Combine positional and non-positional key parts var k = try combineTensors(k_nope, k_pe, 3); defer k.deinit(); // Store in KV cache if available if (self.kv_cache != null) { try self.updateKVCache(k, v, start_pos, end_pos); } // Choose attention implementation based on settings var attention_output: Tensor(f32, 4) = undefined; if (self.use_flash_attention and seq_len > 1) { attention_output = try self.computeFlashAttention( q_nope, q_pe, self.kv_cache.?, self.rope_cache.?, mask, batch_size, seq_len, end_pos ); } else { attention_output = try self.computeStandardAttention( q, k, v, mask, batch_size, seq_len, end_pos ); } defer attention_output.deinit(); // Final projection var attention_flat = try attention_output.reshape( .{batch_size, seq_len, self.n_heads * self.v_head_dim} ); defer attention_flat.deinit(); return self.wo.forward(attention_flat); } // Flash attention implementation optimized for large contexts fn computeFlashAttention( self: *const Self, q_nope: Tensor(f32, 4), q_pe: Tensor(f32, 4), kv_cache: Tensor(f32, 4), rope_cache: Tensor(f32, 3), mask: ?Tensor(f32, 2), batch_size: usize, seq_len: usize, end_pos: usize ) !Tensor(f32, 4) { // Flash attention implementation with tiling to maximize cache efficiency // This function would include a highly optimized SIMD implementation // specializing in memory-efficient attention computation // Note: This would be a substantial implementation with memory-efficient // blocked matrix multiplication and careful SIMD optimization // We're providing a simplified structure here // For a full implementation, see the FlashAttention algorithm paper const block_size = 32; // Block size tuned for L1 cache // Output tensor var output = try Tensor(f32, 4).init( self.allocator, .{batch_size, seq_len, self.n_heads, self.v_head_dim} ); // Implement blocked attention algorithm... // This would contain optimized SIMD code for tiled attention computation return output; } // Standard attention for shorter sequences or when flash attention is disabled fn computeStandardAttention( self: *const Self, q: Tensor(f32, 4), k: Tensor(f32, 4), v: Tensor(f32, 4), mask: ?Tensor(f32, 2), batch_size: usize, seq_len: usize, end_pos: usize ) !Tensor(f32, 4) { // Compute QK attention scores var scores = try computeAttentionScores(q, k, self.softmax_scale); defer scores.deinit(); // Apply causal mask if provided if (mask) |m| { try applyAttentionMask(&scores, m); } // Apply softmax try applySoftmax(&scores, -1); // Compute attention output (scores @ v) return computeAttentionOutput(scores, v); } // Update KV cache with new values fn updateKVCache( self: *Self, k: Tensor(f32, 4), v: Tensor(f32, 4), start_pos: usize, end_pos: usize ) !void { const batch_size = k.shape[0]; const seq_len = k.shape[1]; // Update key cache for (0..batch_size) |b| { for (0..seq_len) |s| { const cache_pos = start_pos + s; for (0..self.n_heads) |h| { // Copy K values for (0..self.qk_head_dim) |d| { const k_val = try k.at(.{b, s, h, d}); try self.kv_cache.?.set(.{b, cache_pos, h, d}, k_val); } // Copy V values for (0..self.v_head_dim) |d| { const v_val = try v.at(.{b, s, h, d}); try self.kv_cache.?.set(.{b, cache_pos, h, self.qk_head_dim + d}, v_val); } } } } } }; } ``` **Key Optimizations:** - **Compile-Time Specialization**: Generated attention routines are tailored to model dimensions at compile time - **Flash Attention Algorithm**: Memory-efficient attention computation for long sequences - **SIMD-Optimized Matrix Operations**: Vectorized attention score calculation and softmax - **Optimized KV-Cache Layout**: Cache-friendly memory layout for efficient sequence generation - **Sparse Attention Patterns**: Support for different attention patterns beyond standard causal attention - **Memory Reuse**: Careful tensor management to minimize allocations during inference - **Specialized Attention Paths**: Different implementations optimized for inference vs. training - **Low-Rank Adaptation**: LoRA support for more efficient fine-tuning #### 2.3 Mixture of Experts (MoE) The Mixture of Experts (MoE) architecture is a key innovation in DeepSeek V3 that enables scaling model capacity without proportionally increasing computation cost. Our Zig implementation leverages compile-time specialization and parallel execution for maximum efficiency: ```zig // Generic MoE implementation with compile-time specialization pub fn MixtureOfExperts(comptime args: ModelArgs) type { return struct { const Self = @This(); const ModelType = args.getModelType(); // Configuration allocator: std.mem.Allocator, dim: usize, n_routed_experts: usize, n_local_experts: usize, n_activated_experts: usize, experts_start_idx: usize, experts_end_idx: usize, use_parallel_execution: bool, // Components gate: RouterGate(args), experts: []Expert(args), shared_experts: MLP(args), thread_pool: ?*ComputeThreadPool = null, // Initialize MoE with appropriate configuration pub fn init(allocator: std.mem.Allocator) !Self { // Determine expert distribution across processes const world_size = 1; // Set to actual world size for distributed training const rank = 0; // Set to actual rank for distributed training std.debug.assert(args.n_routed_experts % world_size == 0, "Number of experts must be divisible by world size"); const n_local_experts = args.n_routed_experts / world_size; const experts_start_idx = rank * n_local_experts; const experts_end_idx = experts_start_idx + n_local_experts; // Initialize routing gate var gate = try RouterGate(args).init(allocator); errdefer gate.deinit(); // Initialize experts var experts = try allocator.alloc(Expert(args), args.n_routed_experts); errdefer allocator.free(experts); // Only initialize experts that belong to this process for (experts, 0..) |*expert, i| { if (experts_start_idx <= i and i < experts_end_idx) { expert.* = try Expert(args).init(allocator); } else { expert.* = undefined; // Not used on this process } } // Initialize shared experts (always executed) var shared_experts = try MLP(args).init( allocator, args.dim, args.n_shared_experts * args.moe_inter_dim ); errdefer shared_experts.deinit(); // Initialize thread pool for parallel execution if needed var thread_pool: ?*ComputeThreadPool = null; if (args.use_parallel_experts) { thread_pool = try allocator.create(ComputeThreadPool); const cpu_count = try std.Thread.getCpuCount(); const optimal_threads = std.math.min( cpu_count, args.n_activated_experts + args.n_shared_experts ); thread_pool.?.* = try ComputeThreadPool.init(optimal_threads); } return Self{ .allocator = allocator, .dim = args.dim, .n_routed_experts = args.n_routed_experts, .n_local_experts = n_local_experts, .n_activated_experts = args.n_activated_experts, .experts_start_idx = experts_start_idx, .experts_end_idx = experts_end_idx, .use_parallel_execution = args.use_parallel_experts, .gate = gate, .experts = experts, .shared_experts = shared_experts, .thread_pool = thread_pool, }; } pub fn deinit(self: *Self) void { self.gate.deinit(); // Only deinit experts that belong to this process for (self.experts, 0..) |*expert, i| { if (self.experts_start_idx <= i and i < self.experts_end_idx) { expert.deinit(); } } self.allocator.free(self.experts); self.shared_experts.deinit(); if (self.thread_pool) |pool| { pool.deinit(); self.allocator.destroy(pool); } } // Forward pass implementation with parallel expert execution pub fn forward(self: *Self, x: Tensor(f32, 3)) !Tensor(f32, 3) { const batch_size = x.shape[0]; const seq_len = x.shape[1]; // Reshape input for routing var x_flat = try x.reshape(.{batch_size * seq_len, self.dim}); defer x_flat.deinit(); // Router computation var router_output = try self.gate.forward(x_flat); defer { router_output.weights.deinit(); router_output.indices.deinit(); } // Get routing weights and indices const weights = router_output.weights; const indices = router_output.indices; // Initialize result tensor with zeros var result = try Tensor(f32, 2).init( self.allocator, .{batch_size * seq_len, self.dim} ); errdefer result.deinit(); @memset(result.data, 0); // Count expert assignments for load balancing analysis var expert_counts = try self.allocator.alloc(usize, self.n_routed_experts); defer self.allocator.free(expert_counts); @memset(expert_counts, 0); for (indices.data) |idx| { expert_counts[idx] += 1; } // Process each expert if (self.use_parallel_execution and self.thread_pool != null) { try self.parallelExpertExecution( x_flat, weights, indices, expert_counts, &result ); } else { try self.sequentialExpertExecution( x_flat, weights, indices, expert_counts, &result ); } // Always execute shared experts var shared_output = try self.shared_experts.forward(x_flat); defer shared_output.deinit(); // Add shared expert output to result try addTensors(&result, shared_output); // Reshape back to original dimensions return result.reshape(.{batch_size, seq_len, self.dim}); } // Parallel execution of experts using thread pool fn parallelExpertExecution( self: *Self, x: Tensor(f32, 2), weights: Tensor(f32, 2), indices: Tensor(usize, 2), expert_counts: []usize, result: *Tensor(f32, 2) ) !void { const thread_pool = self.thread_pool.?; var work_queue = std.ArrayList(ExpertWorkItem).init(self.allocator); defer work_queue.deinit(); // Create work items for each expert for (0..self.n_routed_experts) |expert_idx| { if (expert_counts[expert_idx] == 0) continue; if (expert_idx < self.experts_start_idx or expert_idx >= self.experts_end_idx) { // Skip experts not assigned to this process continue; } // Extract tokens routed to this expert var token_indices = try self.allocator.alloc(usize, expert_counts[expert_idx]); var token_weights = try self.allocator.alloc(f32, expert_counts[expert_idx]); var token_count: usize = 0; for (0..x.shape[0]) |i| { for (0..self.n_activated_experts) |j| { const index_offset = i * self.n_activated_experts + j; if (indices.data[index_offset] == expert_idx) { token_indices[token_count] = i; token_weights[token_count] = weights.data[index_offset]; token_count += 1; } } } // Create work item try work_queue.append(.{ .allocator = self.allocator, .expert = &self.experts[expert_idx], .x = x, .token_indices = token_indices, .token_weights = token_weights, .result = result, .thread_pool = thread_pool, }); } // Schedule parallel expert execution for (work_queue.items) |*work_item| { // Increment completion counter _ = thread_pool.completion_count.fetchAdd(1, .Release); // Submit task to thread pool try thread_pool.compute(processExpertWork, work_item); } // Wait for all expert computations to complete thread_pool.waitAll(); } // Sequential execution of experts fn sequentialExpertExecution( self: *Self, x: Tensor(f32, 2), weights: Tensor(f32, 2), indices: Tensor(usize, 2), expert_counts: []usize, result: *Tensor(f32, 2) ) !void { // Process each expert sequentially for (0..self.n_routed_experts) |expert_idx| { if (expert_counts[expert_idx] == 0) continue; if (expert_idx < self.experts_start_idx or expert_idx >= self.experts_end_idx) { // Skip experts not assigned to this process continue; } // Get tokens assigned to this expert for (0..x.shape[0]) |i| { for (0..self.n_activated_experts) |j| { const index_offset = i * self.n_activated_experts + j; if (indices.data[index_offset] == expert_idx) { // Process token with this expert const token_weight = weights.data[index_offset]; // Extract input token var token_input = try x.slice(.{i, 0}, .{i + 1, self.dim}); defer token_input.deinit(); // Process through expert var expert_output = try self.experts[expert_idx].forward(token_input); defer expert_output.deinit(); // Scale by routing weight try scaleTensor(&expert_output, token_weight); // Add to result for (0..self.dim) |d| { result.data[i * self.dim + d] += expert_output.data[d]; } } } } } } // Worker task for parallel expert execution const ExpertWorkItem = struct { allocator: std.mem.Allocator, expert: *Expert(args), x: Tensor(f32, 2), token_indices: []usize, token_weights: []f32, result: *Tensor(f32, 2), thread_pool: *ComputeThreadPool, }; fn processExpertWork(ctx_ptr: *anyopaque) void { const ctx = @ptrCast(*ExpertWorkItem, @alignCast(@alignOf(ExpertWorkItem), ctx_ptr)); defer { ctx.allocator.free(ctx.token_indices); ctx.allocator.free(ctx.token_weights); _ = ctx.thread_pool.completion_count.fetchSub(1, .Release); } // Process each token assigned to this expert for (ctx.token_indices, ctx.token_weights, 0..) |token_idx, weight, i| { // Extract input token var token_input = ctx.x.slice(.{token_idx, 0}, .{token_idx + 1, ctx.x.shape[1]}) catch return; defer token_input.deinit(); // Process through expert var expert_output = ctx.expert.forward(token_input) catch return; defer expert_output.deinit(); // Scale by routing weight scaleTensor(&expert_output, weight) catch return; // Add to result (using atomic operations to avoid race conditions) for (0..expert_output.shape[1]) |d| { const offset = token_idx * expert_output.shape[1] + d; const old_val = @atomicLoad(f32, &ctx.result.data[offset], .Acquire); const new_val = old_val + expert_output.data[d]; @atomicStore(f32, &ctx.result.data[offset], new_val, .Release); } } } }; } // Router gate for MoE that determines which experts to use for each token pub fn RouterGate(comptime args: ModelArgs) type { return struct { const Self = @This(); allocator: std.mem.Allocator, dim: usize, n_experts: usize, n_groups: usize, n_limited_groups: usize, topk: usize, score_func: enum { softmax, sigmoid }, route_scale: f32, // Router weights weight: Tensor(f32, 2), bias: ?Tensor(f32, 1) = null, pub fn init(allocator: std.mem.Allocator) !Self { var weight = try Tensor(f32, 2).init( allocator, .{args.n_routed_experts, args.dim} ); // Initialize with appropriate distribution try initializeParameters(&weight, 0.0, 0.02); // Create optional bias var bias: ?Tensor(f32, 1) = null; if (args.dim == 7168) { // Special case for bias bias = try Tensor(f32, 1).init(allocator, .{args.n_routed_experts}); @memset(bias.?.data, 0); } return Self{ .allocator = allocator, .dim = args.dim, .n_experts = args.n_routed_experts, .n_groups = args.n_expert_groups, .n_limited_groups = args.n_limited_groups, .topk = args.n_activated_experts, .score_func = args.score_func, .route_scale = args.route_scale, .weight = weight, .bias = bias, }; } pub fn deinit(self: *Self) void { self.weight.deinit(); if (self.bias) |*b| b.deinit(); } // Router forward pass to determine expert assignment pub fn forward(self: *const Self, x: Tensor(f32, 2)) !RouterOutput { // Compute routing scores var scores = try linearProjection(x, self.weight, self.bias); defer scores.deinit(); // Apply scoring function var routing_probs: Tensor(f32, 2) = undefined; if (self.score_func == .softmax) { routing_probs = try applySoftmax(scores, 1); } else { routing_probs = try applySigmoid(scores); } defer routing_probs.deinit(); // Save original scores for later var original_scores = try routing_probs.clone(); // Expert group handling if (self.n_groups > 1) { try self.applyGroupFiltering(&routing_probs); } // Select top-k experts var indices = try Tensor(usize, 2).init( self.allocator, .{x.shape[0], self.topk} ); var weights = try Tensor(f32, 2).init( self.allocator, .{x.shape[0], self.topk} ); try self.selectTopkExperts(routing_probs, original_scores, &indices, &weights); // Apply routing scale if (self.route_scale != 1.0) { try scaleTensor(&weights, self.route_scale); } return RouterOutput{ .weights = weights, .indices = indices, }; } // Apply expert group filtering fn applyGroupFiltering(self: *const Self, scores: *Tensor(f32, 2)) !void { // Reshape scores for group processing const batch_size = scores.shape[0]; const experts_per_group = self.n_experts / self.n_groups; var reshaped_scores = try scores.reshape( .{batch_size, self.n_groups, experts_per_group} ); defer reshaped_scores.deinit(); // Compute group scores var group_scores = try Tensor(f32, 2).init( self.allocator, .{batch_size, self.n_groups} ); defer group_scores.deinit(); // Calculate score for each group if (self.bias == null) { // Use max score as group score for (0..batch_size) |b| { for (0..self.n_groups) |g| { var max_score: f32 = -std.math.inf_f32; for (0..experts_per_group) |e| { const score = try reshaped_scores.at(.{b, g, e}); if (score > max_score) max_score = score; } try group_scores.set(.{b, g}, max_score); } } } else { // Use sum of top-2 scores as group score for (0..batch_size) |b| { for (0..self.n_groups) |g| { var scores_arr = try self.allocator.alloc(f32, experts_per_group); defer self.allocator.free(scores_arr); // Extract scores for this group for (0..experts_per_group) |e| { scores_arr[e] = try reshaped_scores.at(.{b, g, e}); } // Sort to find top-2 std.sort.sort(f32, scores_arr, {}, std.sort.desc(f32)); // Sum top-2 scores const group_score = scores_arr[0] + scores_arr[1]; try group_scores.set(.{b, g}, group_score); } } } // Find top-k groups var top_groups = try Tensor(usize, 2).init( self.allocator, .{batch_size, self.n_limited_groups} ); defer top_groups.deinit(); // Select top-k groups for (0..batch_size) |b| { var scores_arr = try self.allocator.alloc(struct { score: f32, idx: usize }, self.n_groups); defer self.allocator.free(scores_arr); // Prepare for sorting for (0..self.n_groups) |g| { scores_arr[g] = .{ .score = try group_scores.at(.{b, g}), .idx = g, }; } // Sort by score const Sort = struct { fn desc(context: void, a: anytype, b: anytype) bool { return a.score > b.score; } }; std.sort.sort(struct { score: f32, idx: usize }, scores_arr, {}, Sort.desc); // Store top-k group indices for (0..self.n_limited_groups) |i| { try top_groups.set(.{b, i}, scores_arr[i].idx); } } // Create mask for filtering var mask = try Tensor(bool, 3).init( self.allocator, .{batch_size, self.n_groups, 1} ); defer mask.deinit(); // Initialize all groups as masked (excluded) @memset(mask.data, true); // Unmask top groups for (0..batch_size) |b| { for (0..self.n_limited_groups) |i| { const g = try top_groups.at(.{b, i}); try mask.set(.{b, g, 0}, false); } } // Apply mask for (0..batch_size) |b| { for (0..self.n_groups) |g| { const is_masked = try mask.at(.{b, g, 0}); if (is_masked) { // Mask out this group by setting scores to -inf for (0..experts_per_group) |e| { try reshaped_scores.set(.{b, g, e}, -std.math.inf_f32); } } } } // Reshape back to original shape try scores.copyFrom(reshaped_scores.reshape(.{batch_size, self.n_experts}) catch unreachable); } // Select top-k experts based on routing scores fn selectTopkExperts( self: *const Self, scores: Tensor(f32, 2), original_scores: Tensor(f32, 2), indices: *Tensor(usize, 2), weights: *Tensor(f32, 2) ) !void { const batch_size = scores.shape[0]; for (0..batch_size) |b| { var scores_arr = try self.allocator.alloc(struct { score: f32, idx: usize }, self.n_experts); defer self.allocator.free(scores_arr); // Prepare for sorting for (0..self.n_experts) |e| { scores_arr[e] = .{ .score = try scores.at(.{b, e}), .idx = e, }; } // Sort by score const Sort = struct { fn desc(context: void, a: anytype, b: anytype) bool { return a.score > b.score; } }; std.sort.sort(struct { score: f32, idx: usize }, scores_arr, {}, Sort.desc); // Store top-k indices and get weights from original scores for (0..self.topk) |i| { const expert_idx = scores_arr[i].idx; try indices.set(.{b, i}, expert_idx); // Get weight from original scores const weight = try original_scores.at(.{b, expert_idx}); try weights.set(.{b, i}, weight); } // Normalize weights for sigmoid scoring if (self.score_func == .sigmoid) { var sum: f32 = 0.0; for (0..self.topk) |i| { sum += try weights.at(.{b, i}); } if (sum > 0.0) { for (0..self.topk) |i| { const w = try weights.at(.{b, i}); try weights.set(.{b, i}, w / sum); } } } } } }; } // Output from router gate pub const RouterOutput = struct { weights: Tensor(f32, 2), // [batch_size, topk] indices: Tensor(usize, 2), // [batch_size, topk] }; ``` **Key Features:** - **Compile-Time Specialization**: Generated MoE implementation tailored to model dimensions and configuration - **Parallel Expert Execution**: Efficient multi-threading with work distribution and load balancing - **Atomic Operations**: Thread-safe updates to shared tensors - **Group-Based Routing**: Optimized implementation of expert groups for more efficient routing - **Memory-Efficient Tensor Management**: Careful handling of temporary allocations - **Flexible Scoring Functions**: Support for both softmax and sigmoid routing - **Expert Load Balancing**: Runtime tracking of expert utilization - **Distributed Expert Sharding**: Support for distributing experts across multiple processes ### 3. Computation Backend Outlining the computation backend architecture for the DeepSeek-V3 project implemented in Zig. The design emphasizes performance, modularity, and hardware portability. #### 3.1 Backend Interface The backend interface provides a unified abstraction layer for all computation targets while maintaining Zig's zero-cost abstraction philosophy. ```zig pub const ComputeBackend = struct { const Self = @This(); // Function pointers for backend-specific operations with proper type safety matmulFn: *const fn(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) anyerror!void, softmaxFn: *const fn(x: anytype, dim: usize, allocator: std.mem.Allocator) anyerror!void, rmsnormFn: *const fn(x: anytype, weight: anytype, eps: f32, allocator: std.mem.Allocator) anyerror!void, attentionFn: *const fn(q: anytype, k: anytype, v: anytype, mask: ?anytype, scale: f32, allocator: std.mem.Allocator) anyerror!void, // Other operations... // Configuration for the backend config: BackendConfig, // Dispatch based on backend type with proper error handling pub fn matmul(self: *const Self, a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { return self.matmulFn(a, b, c, allocator); } pub fn softmax(self: *const Self, x: anytype, dim: usize, allocator: std.mem.Allocator) !void { return self.softmaxFn(x, dim, allocator); } pub fn rmsnorm(self: *const Self, x: anytype, weight: anytype, eps: f32, allocator: std.mem.Allocator) !void { return self.rmsnormFn(x, weight, eps, allocator); } pub fn attention(self: *const Self, q: anytype, k: anytype, v: anytype, mask: ?anytype, scale: f32, allocator: std.mem.Allocator) !void { return self.attentionFn(q, k, v, mask, scale, allocator); } }; ``` #### 3.2 Platform-Specific Implementations ```zig pub const CPUBackend = struct { allocator: std.mem.Allocator, thread_pool: ?*ThreadPool, pub fn init(allocator: std.mem.Allocator, thread_count: ?usize) !ComputeBackend { const thread_pool = if (thread_count) |count| { try ThreadPool.init(allocator, .{ .thread_count = count }); } else null; return ComputeBackend{ .matmulFn = cpuMatmul, .softmaxFn = cpuSoftmax, .rmsnormFn = cpuRmsnorm, .attentionFn = cpuAttention, // Other operations... .config = BackendConfig{ .backend_type = .Cpu, .max_threads = thread_count, // Other CPU-specific config... }, }; } fn cpuMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { // Dynamically select the optimal implementation based on matrix dimensions and CPU features if (c.rows * c.cols > 1024 * 1024 and detectCpuFeatures().use_avx2) { return cpuMatmulParallel(a, b, c, allocator); } return cpuMatmulSIMD(a, b, c, allocator); } fn cpuSoftmax(x: anytype, dim: usize, allocator: std.mem.Allocator) !void { // Optimized CPU implementation using SIMD // Implementation details... } // Other CPU-specific implementations... }; pub const MetalBackend = struct { device: *MTLDevice, command_queue: *MTLCommandQueue, library: *MTLLibrary, allocator: std.mem.Allocator, pipelines: PipelineCache, pub fn init(allocator: std.mem.Allocator) !ComputeBackend { // Initialize Metal device, command queue, and library const device = MTLCreateSystemDefaultDevice() orelse return error.MetalDeviceNotAvailable; const command_queue = device.newCommandQueue() orelse return error.CommandQueueCreationFailed; // Load compute shaders from embedded metal code or compiled library const library = try loadDefaultLibrary(device); // Initialize pipeline cache var pipelines = PipelineCache.init(allocator); try pipelines.precompileEssentialPipelines(device, library); return ComputeBackend{ .matmulFn = metalMatmul, .softmaxFn = metalSoftmax, .rmsnormFn = metalRmsnorm, .attentionFn = metalAttention, // Other operations... .config = BackendConfig{ .backend_type = .Metal, .workgroup_size = .{16, 16, 1}, .shared_memory_size = 32 * 1024, // Other Metal-specific config... }, }; } fn metalMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { // Implementation using Metal Performance Shaders when available // Fallback to custom compute kernel for specialized operations // Implementation details... } fn metalSoftmax(x: anytype, dim: usize, allocator: std.mem.Allocator) !void { // Metal implementation // Implementation details... } // Other Metal-specific implementations... }; ``` **Key Features:** - Abstract interface with compile-time type safety - Proper error handling with Zig's error system - Zero-cost abstraction for backend dispatch - Dynamic backend selection based on available hardware - Specialized implementations for different hardware architectures - Thread pool integration for CPU parallelism - Resource management for GPU backends - Pipeline caching for improved performance #### 3.3 SIMD Vectorization DeepSeek-V3 leverages Zig's built-in vector types to achieve high-performance computation across different architectures. ```zig // Define vector types with architecture-specific sizes pub fn VectorType(comptime T: type, comptime len: usize) type { return @Vector(len, T); } // Compile-time determination of optimal vector size pub fn getOptimalVectorSize(comptime T: type) usize { const target = @import("builtin").target; // Determine vector size based on architecture and data type if (T == f32) { if (target.cpu.arch == .x86_64 or target.cpu.arch == .x86) { if (target.cpu.features.isEnabled(.avx512f)) { return 16; // 512 bits / 32 bits = 16 elements } else if (target.cpu.features.isEnabled(.avx2)) { return 8; // 256 bits / 32 bits = 8 elements } else if (target.cpu.features.isEnabled(.sse4_1)) { return 4; // 128 bits / 32 bits = 4 elements } } else if (target.cpu.arch == .aarch64) { if (target.cpu.features.isEnabled(.neon)) { return 4; // 128 bits / 32 bits = 4 elements } } } else if (T == f16) { // Similar logic for f16 with doubled vector sizes // ... } // Default fallback return 4; } // Example of SIMD matrix multiplication pub fn matrixMultiplySIMD(comptime T: type, a: []const T, b: []const T, c: []T, m: usize, n: usize, k: usize) void { const vec_size = comptime getOptimalVectorSize(T); const Vec = VectorType(T, vec_size); // Process blocks that align with vector size const k_vec = k / vec_size * vec_size; for (0..m) |i| { for (0..n) |j| { var sum: T = 0; var vec_sum: Vec = @splat(0); // Vector part var kv: usize = 0; while (kv < k_vec) : (kv += vec_size) { const a_vec = blk: { var tmp: Vec = undefined; for (0..vec_size) |v| { tmp[v] = a[i * k + kv + v]; } break :blk tmp; }; const b_vec = blk: { var tmp: Vec = undefined; for (0..vec_size) |v| { tmp[v] = b[kv + v + j * k]; } break :blk tmp; }; vec_sum += a_vec * b_vec; } // Reduce vector for (0..vec_size) |v| { sum += vec_sum[v]; } // Remaining elements for (k_vec..k) |kk| { sum += a[i * k + kk] * b[kk + j * k]; } c[i * n + j] = sum; } } } ``` #### 3.4 Runtime CPU Feature Detection ```zig pub fn detectCpuFeatures() BackendConfig { var config = BackendConfig{ .backend_type = BackendType.Cpu, }; // Try to detect CPU features at runtime const cpu_info = std.zig.system.getCpuInfo() catch { // Fallback to safe defaults if detection fails return config; }; // Configure based on detected features config.use_avx512 = cpu_info.features.isEnabled(.avx512f); config.use_avx2 = cpu_info.features.isEnabled(.avx2); config.use_sse4_1 = cpu_info.features.isEnabled(.sse4_1); config.use_neon = cpu_info.features.isEnabled(.neon); return config; } ``` #### 3.5 Backend Configuration Backend configuration allows fine-tuning performance characteristics based on hardware capabilities and workload requirements. ```zig pub const BackendType = enum { Cpu, Cuda, Metal, Vulkan, WebGPU, }; pub const BackendConfig = struct { backend_type: BackendType, max_threads: ?usize = null, cache_line_size: usize = 64, // Default x86-64 cache line size use_avx512: bool = false, // Use AVX-512 when available use_avx2: bool = true, // Use AVX2 when available use_sse4_1: bool = true, // Use SSE4.1 when available use_neon: bool = false, // Use ARM NEON when available prefetch_distance: usize = 8, // Prefetch N cache lines ahead tiling_size: ?[2]usize = null, // Matrix tiling dimensions batch_size: ?usize = null, // Batch size for kernel operations memory_pool_size: ?usize = null, // Size of pre-allocated memory pool use_half_precision: bool = false, // Use FP16 where appropriate use_mixed_precision: bool = true, // Use mixed precision for matmul // GPU-specific options workgroup_size: ?[3]usize = null, // GPU workgroup dimensions shared_memory_size: ?usize = null, // GPU shared memory allocation compute_queue_depth: usize = 3, // Maximum concurrent compute operations }; ``` #### 3.6 GPU Integration DeepSeek-V3 supports multiple GPU backends, with specialized implementations for each platform. #### 3.6.1 CUDA Backend ```zig pub const CudaBackend = struct { allocator: std.mem.Allocator, device: i32, stream: ?*anyopaque, handles: CudaHandles, module_cache: ModuleCache, pub fn init(allocator: std.mem.Allocator, device_id: ?i32) !ComputeBackend { // Initialize CUDA device, context, and stream const device = if (device_id) |id| id else try getOptimalCudaDevice(); try cudaSetDevice(device); var stream: ?*anyopaque = null; try checkCudaStatus(cudaStreamCreate(&stream)); // Initialize cuBLAS and cuDNN handles var handles = try CudaHandles.init(stream); // Compile and cache essential CUDA kernels var module_cache = try ModuleCache.init(allocator); try module_cache.compileEssentialKernels(); return ComputeBackend{ .matmulFn = cudaMatmul, .softmaxFn = cudaSoftmax, .rmsnormFn = cudaRmsnorm, .attentionFn = cudaAttention, // Other operations... .config = BackendConfig{ .backend_type = .Cuda, .workgroup_size = .{16, 16, 1}, .shared_memory_size = 48 * 1024, // Other CUDA-specific config... }, }; } fn cudaMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { // Use cuBLAS for large matrices // Fall back to custom kernels for specialized operations // Implementation details... } // Other CUDA-specific implementations... }; ``` #### 3.6.2 Vulkan Backend ```zig pub const VulkanBackend = struct { allocator: std.mem.Allocator, instance: vk.Instance, physical_device: vk.PhysicalDevice, device: vk.Device, compute_queue: vk.Queue, command_pool: vk.CommandPool, pipeline_cache: vk.PipelineCache, shader_modules: ShaderModuleCache, pub fn init(allocator: std.mem.Allocator) !ComputeBackend { // Initialize Vulkan instance, device, and queues // Implementation details... return ComputeBackend{ .matmulFn = vulkanMatmul, .softmaxFn = vulkanSoftmax, .rmsnormFn = vulkanRmsnorm, .attentionFn = vulkanAttention, // Other operations... .config = BackendConfig{ .backend_type = .Vulkan, // Vulkan-specific config... }, }; } // Vulkan-specific implementations... }; ``` #### 3.7 Quantization Framework The quantization framework enables efficient model deployment through reduced precision arithmetic. ```zig // Supported quantization methods pub const QuantizationMethod = enum { None, FP16, // Half precision Int8, // 8-bit integer quantization Int4, // 4-bit integer quantization NF4, // NormalFloat4 quantization GPTQ, // GPTQ quantization AWQ, // Activation-aware weight quantization }; // Quantization configuration pub const QuantConfig = struct { method: QuantizationMethod = .None, scale_type: ?type = null, // Type for quantization scales group_size: usize = 128, // Size of quantization groups bits: u8 = 8, // Bits per quantized value symmetric: bool = false, // Symmetric vs asymmetric quantization // Calibration parameters calibration_dataset: ?[]const u8 = null, num_calibration_samples: usize = 128, // Sparsity options use_sparse: bool = false, sparsity_threshold: f32 = 0.01, }; // Abstract quantizer interface pub const Quantizer = struct { const Self = @This(); quantizeFn: *const fn(self: *Self, tensor: Tensor, config: QuantConfig, allocator: std.mem.Allocator) anyerror!Tensor, dequantizeFn: *const fn(self: *Self, tensor: Tensor, allocator: std.mem.Allocator) anyerror!Tensor, pub fn quantize(self: *Self, tensor: Tensor, config: QuantConfig, allocator: std.mem.Allocator) !Tensor { return self.quantizeFn(self, tensor, config, allocator); } pub fn dequantize(self: *Self, tensor: Tensor, allocator: std.mem.Allocator) !Tensor { return self.dequantizeFn(self, tensor, allocator); } }; ``` #### 3.8 Memory Management Efficient memory management is crucial for large language model inference. ```zig // Memory allocation strategy pub const AllocStrategy = enum { Default, // Standard allocator Arena, // Arena allocator for bulk allocations Pool, // Memory pool for fixed-size allocations Streaming, // Streaming allocator for pipelined operations Pinned, // Pinned memory for efficient host-device transfers }; // Memory pool for efficient tensor allocations pub const TensorMemoryPool = struct { const Self = @This(); parent_allocator: std.mem.Allocator, pool: std.heap.MemoryPool, block_sizes: []const usize, blocks: std.AutoArrayHashMap(usize, std.ArrayList(*anyopaque)), mutex: std.Thread.Mutex, stats: MemoryStats, pub fn init(allocator: std.mem.Allocator, config: MemoryPoolConfig) !Self { // Initialize memory pool with predefined block sizes // Implementation details... } pub fn allocate(self: *Self, size: usize, alignment: usize) ![]u8 { // Find the appropriate block size or allocate directly // Implementation details... } pub fn free(self: *Self, ptr: []u8) void { // Return to pool or free directly // Implementation details... } // Memory management utilities pub fn preallocate(self: *Self, block_size: usize, count: usize) !void { // Preallocate multiple blocks of the specified size // Implementation details... } pub fn reclaim(self: *Self) void { // Reclaim unused memory blocks // Implementation details... } }; // Key-Value cache management for efficient inference pub const KVCache = struct { allocator: std.mem.Allocator, k_cache: Tensor, v_cache: Tensor, capacity: usize, size: usize, head_dim: usize, num_heads: usize, pub fn init(allocator: std.mem.Allocator, batch_size: usize, num_heads: usize, head_dim: usize, max_seq_len: usize) !Self { // Initialize key-value cache with appropriate dimensions // Implementation details... } pub fn append(self: *Self, k: Tensor, v: Tensor, pos: usize) !void { // Append new key-value pairs to the cache // Implementation details... } pub fn prefill(self: *Self, k: Tensor, v: Tensor) !void { // Prefill the cache with initial key-value pairs // Implementation details... } pub fn rotatePositions(self: *Self, positions: []const usize) !void { // Rearrange cache entries based on position IDs (for speculative decoding) // Implementation details... } pub fn clear(self: *Self) void { // Reset the cache size without deallocating memory // Implementation details... } }; ``` #### 3.9 Metal Integration for Apple Silicon Modern Apple Silicon devices offer exceptional compute performance, and our Zig implementation takes full advantage of these capabilities through direct Metal API integration: ```zig pub const MetalBackend = struct { const Self = @This(); // Core Metal resources device: *MTLDevice, command_queue: *MTLCommandQueue, library: *MTLLibrary, // Pipeline cache for reusing compiled compute pipelines pipeline_cache: std.AutoHashMap(u64, *MTLComputePipelineState), // Memory management allocator: std.mem.Allocator, buffer_pool: BufferPool, // Configuration and statistics config: BackendConfig, stats: MetalStatistics, pub fn init(allocator: std.mem.Allocator) !*Self { // Get the default Metal device var device = MTLCreateSystemDefaultDevice(); if (device == null) return error.MetalDeviceNotAvailable; // Create a command queue for submitting work to the GPU var command_queue = device.?.newCommandQueue(); if (command_queue == null) return error.MetalCommandQueueCreationFailed; // Compile our Metal shader library from source or load precompiled metallib var library: ?*MTLLibrary = null; if (comptime @import("builtin").mode == .Debug) { // Compile from source for easier debugging library = try compileLibraryFromSource(device.?, shader_source); } else { // Use precompiled metallib for release builds const metallib_path = try findMetalLibPath(allocator); defer allocator.free(metallib_path); library = try loadCompiledLibrary(device.?, metallib_path); } // Create the Metal backend var self = try allocator.create(Self); errdefer allocator.destroy(self); // Initialize the pipeline cache var pipeline_cache = std.AutoHashMap(u64, *MTLComputePipelineState).init(allocator); errdefer pipeline_cache.deinit(); // Initialize the buffer pool for efficient memory reuse var buffer_pool = try BufferPool.init(allocator, device.?); errdefer buffer_pool.deinit(); // Get optimal configuration based on the device capabilities var config = try getMetalOptimalConfig(device.?); self.* = .{ .device = device.?, .command_queue = command_queue.?, .library = library.?, .pipeline_cache = pipeline_cache, .allocator = allocator, .buffer_pool = buffer_pool, .config = config, .stats = MetalStatistics.init(), }; return self; } pub fn deinit(self: *Self) void { // Release all cached pipelines var it = self.pipeline_cache.valueIterator(); while (it.next()) |pipeline| { pipeline.*.release(); } self.pipeline_cache.deinit(); // Clean up buffer pool self.buffer_pool.deinit(); // Release Metal resources self.library.release(); self.command_queue.release(); self.device.release(); // Free memory self.allocator.destroy(self); } // Get or create a compute pipeline for a function pub fn getPipeline(self: *Self, function_name: []const u8) !*MTLComputePipelineState { // Hash the function name for quick lookup const hash = std.hash.CityHash64.hash(function_name); // Check if we already have a cached pipeline if (self.pipeline_cache.get(hash)) |pipeline| { return pipeline; } // Create a new pipeline if not found var function = self.library.newFunctionWithName(function_name); if (function == null) return error.MetalFunctionNotFound; defer function.?.release(); // Create the compute pipeline var pipeline_desc = MTLComputePipelineDescriptor.alloc().init(); defer pipeline_desc.release(); pipeline_desc.setComputeFunction(function.?); // Enable buffer mutability tracking in debug mode if (comptime @import("builtin").mode == .Debug) { pipeline_desc.setMutabilityOptions(.{ .MTLPipelineBufferMutabilityAccessTracking = true, }); } // Enable threadgroup memory length optimization pipeline_desc.setThreadGroupSizeIsMultipleOfThreadExecutionWidth(true); // Create the pipeline state var error_ptr: ?*NSError = null; var pipeline = self.device.newComputePipelineStateWithDescriptor( pipeline_desc, .MTLPipelineOptionArgumentInfo, null, &error_ptr ); if (pipeline == null) { if (error_ptr != null) { // Log the error details const error_str = error_ptr.?.localizedDescription().UTF8String(); std.log.err("Failed to create pipeline for {s}: {s}", .{ function_name, error_str, }); error_ptr.?.release(); } return error.MetalPipelineCreationFailed; } // Cache the pipeline for future use try self.pipeline_cache.put(hash, pipeline.?); return pipeline.?; } // Execute a compute kernel with the given parameters pub fn executeKernel( self: *Self, kernel_name: []const u8, grid_size: [3]u32, block_size: [3]u32, buffers: []const MetalBuffer, wait_until_completed: bool, ) !void { // Get the pipeline for this kernel var pipeline = try self.getPipeline(kernel_name); // Create a command buffer var command_buffer = self.command_queue.commandBuffer(); if (command_buffer == null) return error.MetalCommandBufferCreationFailed; // Create a compute command encoder var encoder = command_buffer.?.computeCommandEncoder(); if (encoder == null) return error.MetalComputeEncoderCreationFailed; // Set the compute pipeline encoder.?.setComputePipelineState(pipeline); // Bind buffers for (buffers, 0..) |buffer, i| { encoder.?.setBuffer(buffer.handle, buffer.offset, @intCast(i)); } // Calculate threadgroup size var threadgroup_size = MTLSize{ .width = block_size[0], .height = block_size[1], .depth = block_size[2], }; // Calculate grid size var grid = MTLSize{ .width = grid_size[0], .height = grid_size[1], .depth = grid_size[2], }; // Dispatch the compute work encoder.?.dispatchThreadgroups(grid, threadgroup_size); // End encoding encoder.?.endEncoding(); // Commit the command buffer command_buffer.?.commit(); // Wait for completion if requested if (wait_until_completed) { command_buffer.?.waitUntilCompleted(); } // Update statistics self.stats.kernel_executions += 1; } // Create a buffer and copy data to it pub fn createBuffer( self: *Self, data: []const u8, options: MTLResourceOptions, ) !*MTLBuffer { // Get a buffer from the pool or create a new one var buffer = try self.buffer_pool.getBuffer(data.len, options); // Copy data to the buffer @memcpy(buffer.contents()[0..data.len], data); return buffer; } // Create a tensor in Metal memory pub fn createTensor(self: *Self, tensor: Tensor(f32, 2)) !MetalTensor { // Calculate size in bytes const size_bytes = tensor.data.len * @sizeOf(f32); // Create a buffer var buffer = try self.createBuffer( @ptrCast([*]const u8, tensor.data.ptr)[0..size_bytes], .StorageModeShared ); return MetalTensor{ .buffer = buffer, .shape = tensor.shape, .element_type = .f32, }; } // Example implementation of matrix multiplication using Metal pub fn matmul( self: *Self, a: Tensor(f32, 2), b: Tensor(f32, 2), ) !Tensor(f32, 2) { // Validate dimensions std.debug.assert(a.shape[1] == b.shape[0], "Incompatible matrix dimensions"); const m = a.shape[0]; const k = a.shape[1]; const n = b.shape[1]; // Create result tensor var result = try Tensor(f32, 2).init(self.allocator, .{m, n}); errdefer result.deinit(); // Create Metal tensors var a_metal = try self.createTensor(a); defer a_metal.buffer.release(); var b_metal = try self.createTensor(b); defer b_metal.buffer.release(); var result_metal = try self.createTensor(result); defer result_metal.buffer.release(); // Create dimension buffer const dims = [_]u32{@intCast(m), @intCast(k), @intCast(n)}; var dims_buffer = try self.createBuffer( @ptrCast([*]const u8, &dims)[0..dims.len * @sizeOf(u32)], .StorageModeShared ); defer dims_buffer.release(); // Set up buffers const buffers = [_]MetalBuffer{ .{ .handle = a_metal.buffer, .offset = 0 }, .{ .handle = b_metal.buffer, .offset = 0 }, .{ .handle = result_metal.buffer, .offset = 0 }, .{ .handle = dims_buffer, .offset = 0 }, }; // Calculate optimal workgroup size const workgroup_size: [3]u32 = if (self.config.workgroup_size) |ws| .{ @intCast(ws[0]), @intCast(ws[1]), 1 } else .{ 16, 16, 1 }; // Calculate grid size const grid_size: [3]u32 = .{ (n + workgroup_size[0] - 1) / workgroup_size[0], (m + workgroup_size[1] - 1) / workgroup_size[1], 1, }; // Execute the kernel try self.executeKernel( "matmul", grid_size, workgroup_size, &buffers, true ); // Copy data back from Metal @memcpy( result.data, @ptrCast([*]const f32, result_metal.buffer.contents())[0..result.data.len] ); return result; } }; // Efficient buffer pooling to avoid frequent allocations pub const BufferPool = struct { const Self = @This(); allocator: std.mem.Allocator, device: *MTLDevice, free_buffers: std.AutoHashMap(u64, std.ArrayList(*MTLBuffer)), pub fn init(allocator: std.mem.Allocator, device: *MTLDevice) !Self { return Self{ .allocator = allocator, .device = device, .free_buffers = std.AutoHashMap(u64, std.ArrayList(*MTLBuffer)).init(allocator), }; } pub fn deinit(self: *Self) void { // Release all buffers var it = self.free_buffers.valueIterator(); while (it.next()) |buffer_list| { for (buffer_list.items) |buffer| { buffer.release(); } buffer_list.deinit(); } self.free_buffers.deinit(); } // Get a buffer of at least the requested size pub fn getBuffer(self: *Self, size: usize, options: MTLResourceOptions) !*MTLBuffer { // Round up to power of 2 for better reuse const aligned_size = nextPowerOfTwo(size); // Check if we have a free buffer of appropriate size if (self.free_buffers.getPtr(aligned_size)) |buffer_list| { if (buffer_list.items.len > 0) { // Reuse an existing buffer return buffer_list.pop(); } } // Create a new buffer if none available var buffer = self.device.newBufferWithLength(aligned_size, options); if (buffer == null) return error.MetalBufferAllocationFailed; return buffer.?; } // Return a buffer to the pool for reuse pub fn releaseBuffer(self: *Self, buffer: *MTLBuffer) !void { const size = buffer.length(); const aligned_size = nextPowerOfTwo(size); // Add to the appropriate size list if (self.free_buffers.getPtr(aligned_size)) |buffer_list| { try buffer_list.append(buffer); } else { // Create a new list if this is the first buffer of this size var buffer_list = std.ArrayList(*MTLBuffer).init(self.allocator); try buffer_list.append(buffer); try self.free_buffers.put(aligned_size, buffer_list); } } // Utility to find next power of two fn nextPowerOfTwo(n: usize) usize { var v = n; v -= 1; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; v |= v >> 32; v += 1; return v; } }; // Representation of a tensor in Metal memory pub const MetalTensor = struct { buffer: *MTLBuffer, shape: []const usize, element_type: enum { f16, f32, }, }; // Helper for buffer binding pub const MetalBuffer = struct { handle: *MTLBuffer, offset: u64 = 0, }; // Statistics for performance monitoring pub const MetalStatistics = struct { kernel_executions: usize = 0, bytes_transferred: usize = 0, peak_memory_usage: usize = 0, pub fn init() MetalStatistics { return .{}; } }; // Example Metal shader source for matrix multiplication const shader_source = \\#include \\using namespace metal; \\ \\kernel void matmul( \\ const device float* a [[buffer(0)]], \\ const device float* b [[buffer(1)]], \\ device float* result [[buffer(2)]], \\ const device uint* dims [[buffer(3)]], \\ uint2 gid [[thread_position_in_grid]], \\ uint2 lid [[thread_position_in_threadgroup]], \\ uint2 lsize [[threads_per_threadgroup]]) \\{ \\ const uint m = dims[0]; \\ const uint k = dims[1]; \\ const uint n = dims[2]; \\ \\ // Check if within bounds \\ if (gid.x >= n || gid.y >= m) return; \\ \\ // Calculate result[gid.y][gid.x] \\ float sum = 0.0f; \\ for (uint i = 0; i < k; i++) { \\ sum += a[gid.y * k + i] * b[i * n + gid.x]; \\ } \\ \\ result[gid.y * n + gid.x] = sum; \\} \\ \\kernel void matmul_optimized( \\ const device float* a [[buffer(0)]], \\ const device float* b [[buffer(1)]], \\ device float* result [[buffer(2)]], \\ const device uint* dims [[buffer(3)]], \\ uint2 gid [[thread_position_in_grid]], \\ uint2 lid [[thread_position_in_threadgroup]], \\ uint2 lsize [[threads_per_threadgroup]]) \\{ \\ const uint m = dims[0]; \\ const uint k = dims[1]; \\ const uint n = dims[2]; \\ \\ // Check if within bounds \\ if (gid.x >= n || gid.y >= m) return; \\ \\ // Use threadgroup memory for caching \\ threadgroup float a_cache[16][16]; \\ threadgroup float b_cache[16][16]; \\ \\ float sum = 0.0f; \\ \\ // Process in tiles \\ for (uint tile = 0; tile < (k + 15) / 16; tile++) { \\ // Load a tile into threadgroup memory \\ const uint tile_idx = tile * 16; \\ \\ if (tile_idx + lid.x < k && gid.y < m) { \\ a_cache[lid.y][lid.x] = a[gid.y * k + tile_idx + lid.x]; \\ } else { \\ a_cache[lid.y][lid.x] = 0.0f; \\ } \\ \\ if (tile_idx + lid.y < k && gid.x < n) { \\ b_cache[lid.y][lid.x] = b[(tile_idx + lid.y) * n + gid.x]; \\ } else { \\ b_cache[lid.y][lid.x] = 0.0f; \\ } \\ \\ // Wait for all threads to load data \\ threadgroup_barrier(mem_flags::mem_threadgroup); \\ \\ // Compute partial dot product for this tile \\ for (uint i = 0; i < 16; i++) { \\ sum += a_cache[lid.y][i] * b_cache[i][lid.x]; \\ } \\ \\ // Wait for all threads to finish using the cached data \\ threadgroup_barrier(mem_flags::mem_threadgroup); \\ } \\ \\ // Write result \\ if (gid.x < n && gid.y < m) { \\ result[gid.y * n + gid.x] = sum; \\ } \\} ; ``` **Apple-Specific Optimizations:** 1. **Metal Shader Integration** - Direct compilation of Metal shaders from Zig source code - Runtime shader compilation in debug mode for easier iteration - Precompiled metallib loading for optimized release builds 2. **Memory Management** - Buffer pooling to minimize allocations and deallocations - Shared memory mode for zero-copy between CPU and GPU - Explicit control over resource storage options 3. **Performance Optimizations** - Tile-based computation for optimal cache utilization - Threadgroup memory usage for shared data access - Work distribution based on detected GPU characteristics - Pipeline state caching for faster kernel dispatching 4. **AMX Acceleration** - Support for Apple Matrix extensions (AMX) - Specialized matrix multiplication operations for M-series chips - Custom shader variants optimized for different Apple Silicon generations 5. **Neural Engine Integration** - Optional ANE (Apple Neural Engine) offloading for supported operations - Hybrid execution strategies combining GPU and Neural Engine - Automatic fallback to Metal for unsupported operations ### 4. Inference Pipeline The inference pipeline is the core execution flow for running the DeepSeek V3 model. Our Zig implementation focuses on efficiency, flexibility, and streaming capabilities. #### 4.1 Model Loading ```zig // The ModelLoader handles loading and initializing DeepSeek V3 models pub const ModelLoader = struct { const Self = @This(); allocator: std.mem.Allocator, config: LoaderConfig, // Configuration for model loading pub const LoaderConfig = struct { // Number of threads to use for weight loading loading_threads: ?usize = null, // Optional cache directory for model weights cache_dir: ?[]const u8 = null, // How to handle safetensors format safetensors_memory_map: bool = true, // Validation level for loaded weights validation: enum { none, basic, full } = .basic, // Device to place model on after loading target_device: BackendType = .Cpu, }; pub fn init(allocator: std.mem.Allocator, config: LoaderConfig) Self { return .{ .allocator = allocator, .config = config, }; } // Load a model from file pub fn loadModel( self: *Self, path: []const u8, model_args: ?ModelArgs, ) !*TransformerModel { const extension = std.fs.path.extension(path); // Determine model format from file extension if (std.mem.eql(u8, extension, ".safetensors")) { return try self.loadFromSafetensors(path, model_args); } else if (std.mem.eql(u8, extension, ".ckpt")) { return try self.loadFromCheckpoint(path, model_args); } else if (std.mem.eql(u8, extension, ".bin")) { return try self.loadFromBinary(path, model_args); } else if (std.fs.cwd().accessZ(path, .{}) == .AccessDenied) { // Could be a Hugging Face model ID, try to download it return try self.loadFromHuggingFace(path, model_args); } return error.UnsupportedModelFormat; } // Load model from SafeTensors format (optimized for memory mapping) fn loadFromSafetensors( self: *Self, path: []const u8, model_args: ?ModelArgs, ) !*TransformerModel { // Open the safetensors file var file = try std.fs.cwd().openFile(path, .{}); defer file.close(); // Memory map the file for zero-copy access if configured if (self.config.safetensors_memory_map) { const file_size = try file.getEndPos(); // Memory map the file const mapped_memory = try std.os.mmap( null, file_size, std.os.PROT.READ, std.os.MAP.PRIVATE, file.handle, 0, ); // Process the memory-mapped safetensors return try self.processSafetensorsMemoryMapped( mapped_memory, file_size, model_args, ); } else { // If memory mapping is disabled, read the file conventionally return try self.processSafetensorsFile(file, model_args); } } // Process a memory-mapped SafeTensors file fn processSafetensorsMemoryMapped( self: *Self, memory: []const u8, file_size: usize, model_args: ?ModelArgs, ) !*TransformerModel { // Parse the header which contains tensor metadata const header_size = std.mem.readIntLittle(u64, memory[0..8]); const header_json = memory[8..8+header_size]; // Parse the JSON header var parsed = try std.json.parseFromSlice( std.json.Value, self.allocator, header_json, .{}, ); defer parsed.deinit(); // Get the model configuration from arguments or try to infer it const args = try self.determineModelArgs(model_args, parsed.value); // Create the model with the determined configuration var model = try TransformerModel.create(self.allocator, args); errdefer model.destroy(); // Create a tensor mapping for zero-copy loading try self.loadTensorsFromSafetensorsMemory( model, memory, header_size, parsed.value, ); // Validate the loaded model if configured if (self.config.validation != .none) { try self.validateModel(model, parsed.value); } return model; } // Load a model from Hugging Face fn loadFromHuggingFace( self: *Self, model_id: []const u8, model_args: ?ModelArgs, ) !*TransformerModel { // Get cache directory or create a temporary one const cache_dir = self.config.cache_dir orelse try std.fs.getAppDataDir(self.allocator, "deepseek-zig"); // Create HF client var hf_client = try HuggingFaceClient.init(self.allocator, cache_dir); defer hf_client.deinit(); // Download the model const model_path = try hf_client.downloadModel(model_id); // Load the downloaded model return try self.loadModel(model_path, model_args); } // Infer model arguments if not explicitly provided fn determineModelArgs( self: *Self, model_args: ?ModelArgs, header: std.json.Value, ) !ModelArgs { if (model_args) |args| { return args; } // Try to infer model configuration from the weight shapes if (header.Object.get("metadata")) |metadata| { if (metadata.Object.get("model_type")) |model_type| { if (std.mem.eql(u8, model_type.String, "deepseek")) { // Extract dimensions from metadata return try self.parseDeepSeekConfig(metadata); } } } // Infer from weight shapes if metadata is not available return try self.inferArgsFromWeights(header); } // ... more implementation details ... }; // Implementation of TransformerModel pub const TransformerModel = struct { const Self = @This(); allocator: std.mem.Allocator, args: ModelArgs, // Tokenizer for text processing tokenizer: *Tokenizer, // Model components embedding: *Embedding, layers: []TransformerLayer, norm: *LayerNorm, lm_head: *Linear, // KV cache for efficient inference kv_cache: ?*KVCache, // Backend for computation backend: *ComputeBackend, // Create a model with the given configuration pub fn create( allocator: std.mem.Allocator, args: ModelArgs, ) !*Self { // Create model components var embedding = try Embedding.create(allocator, args); errdefer embedding.destroy(); var layers = try allocator.alloc(TransformerLayer, args.num_layers); errdefer allocator.free(layers); for (layers, 0..) |*layer, i| { layer.* = try TransformerLayer.create(allocator, args, i); } var norm = try LayerNorm.create(allocator, args.dim); errdefer norm.destroy(); var lm_head = try Linear.create(allocator, args.dim, args.vocab_size); errdefer lm_head.destroy(); // Initialize compute backend var backend = try ComputeBackend.create(allocator); errdefer backend.destroy(); // Initialize tokenizer var tokenizer = try Tokenizer.create(allocator, args.vocab_size); errdefer tokenizer.destroy(); // Create the model var model = try allocator.create(Self); errdefer allocator.destroy(model); model.* = .{ .allocator = allocator, .args = args, .tokenizer = tokenizer, .embedding = embedding, .layers = layers, .norm = norm, .lm_head = lm_head, .kv_cache = null, .backend = backend, }; return model; } // Clean up resources pub fn destroy(self: *Self) void { // Free all components self.tokenizer.destroy(); self.embedding.destroy(); for (self.layers) |*layer| { layer.deinit(); } self.allocator.free(self.layers); self.norm.destroy(); self.lm_head.destroy(); if (self.kv_cache) |kv_cache| { kv_cache.destroy(); } self.backend.destroy(); self.allocator.destroy(self); } // Load a model from a specific path pub fn loadFromPath( allocator: std.mem.Allocator, path: []const u8, args: ?ModelArgs, ) !*Self { var loader = ModelLoader.init(allocator, .{}); return try loader.loadModel(path, args); } // Forward pass for a single token pub fn forward( self: *Self, token_id: usize, position: usize, ) !Tensor(f32, 2) { // Get the token embedding var x = try self.embedding.forward(token_id); // Process through all transformer layers for (self.layers, 0..) |*layer, i| { x = try layer.forward(x, position, self.kv_cache); } // Apply final layer norm x = try self.norm.forward(x); // Project to vocabulary return try self.lm_head.forward(x); } // Prepare the model for generation pub fn prepareForGeneration( self: *Self, max_seq_len: usize, batch_size: usize, ) !void { // Create KV cache if not already created if (self.kv_cache == null) { self.kv_cache = try KVCache.create( self.allocator, self.args, max_seq_len, batch_size, ); } else { // Reset the cache if it already exists try self.kv_cache.?.reset(max_seq_len, batch_size); } } // Load tokenizer from vocabulary file pub fn loadTokenizer( self: *Self, path: []const u8, ) !void { try self.tokenizer.loadFromFile(path); } }; ``` #### 4.2 Generation Strategies ```zig // Configuration for text generation pub const GenerationConfig = struct { // Maximum new tokens to generate max_new_tokens: usize = 128, // Sampling temperature (higher = more random) temperature: f32 = 1.0, // Top-p sampling parameter (0.0-1.0) top_p: f32 = 1.0, // Top-k sampling parameter (0 = disabled) top_k: usize = 0, // Repetition penalty to prevent looping repetition_penalty: f32 = 1.0, // Whether to use sampling or greedy decoding do_sample: bool = true, // Frequency penalty for repeated tokens frequency_penalty: f32 = 0.0, // Presence penalty for token occurrence presence_penalty: f32 = 0.0, // Stop sequences to terminate generation stop_sequences: ?[]const []const u8 = null, // Minimum number of tokens to generate min_new_tokens: ?usize = null, // Beam search width (1 = greedy) num_beams: usize = 1, // Random seed for reproducibility seed: ?u64 = null, // Whether to use speculative decoding use_speculative: bool = false, // Draft model for speculative decoding draft_model: ?*TransformerModel = null, // Number of speculative tokens to generate at once speculative_tokens: usize = 5, }; // Generate text from a model given input tokens pub fn generate( model: *TransformerModel, input_ids: []const usize, config: GenerationConfig, callback: ?fn ([]const u8) void, ) ![]usize { // Initialize RNG with seed if provided var rng = if (config.seed) |seed| std.rand.DefaultPrng.init(seed) else std.rand.DefaultPrng.init(@bitCast(u64, std.time.milliTimestamp())); // Allocate result buffer var result = try model.allocator.alloc( usize, input_ids.len + config.max_new_tokens, ); errdefer model.allocator.free(result); // Copy input tokens @memcpy(result[0..input_ids.len], input_ids); var token_count = input_ids.len; // Prepare model for generation try model.prepareForGeneration( input_ids.len + config.max_new_tokens, 1, // Batch size ); // Process all input tokens to fill KV cache var position: usize = 0; for (input_ids) |token_id| { _ = try model.forward(token_id, position); position += 1; } // Check if we should use speculative decoding if (config.use_speculative and config.draft_model != null) { return try speculativeGenerate( model, config.draft_model.?, result, token_count, position, config, callback, ); } // Set up logit processors based on config var logit_processors = LogitProcessorList.init(model.allocator); defer logit_processors.deinit(); if (config.temperature != 1.0) { try logit_processors.add(TemperatureLogitProcessor.init(config.temperature)); } if (config.repetition_penalty != 1.0) { try logit_processors.add(RepetitionPenaltyLogitProcessor.init( config.repetition_penalty, result[0..token_count], )); } if (config.frequency_penalty != 0.0 or config.presence_penalty != 0.0) { try logit_processors.add(FrequencyPenaltyLogitProcessor.init( config.frequency_penalty, config.presence_penalty, )); } // Main generation loop while (token_count < result.len) { // Get next token logits var logits = try model.forward(result[token_count - 1], position); defer logits.deinit(); // Apply logit processors try logit_processors.process(&logits, result[0..token_count]); // Sample next token const next_token = if (config.do_sample) try sampleNextToken( model.allocator, logits, config.top_p, config.top_k, &rng.random(), ) else try greedyNextToken(logits); // Add token to result result[token_count] = next_token; token_count += 1; position += 1; // Check for stop sequences if (config.stop_sequences) |stop_seqs| { if (checkStopSequences( model.tokenizer, result[0..token_count], stop_seqs, )) { break; } } // Call callback with generated token if provided if (callback != null) { var token_text = try model.tokenizer.decodeTokens( model.allocator, result[token_count-1..token_count], ); defer model.allocator.free(token_text); callback.?(token_text); } // Check if we've reached minimum token count if (config.min_new_tokens) |min_tokens| { if (token_count >= input_ids.len + min_tokens) { // Check if we're at an EOS token if (next_token == model.tokenizer.eos_token_id) { break; } } } else if (next_token == model.tokenizer.eos_token_id) { // Otherwise just stop at EOS break; } } // Resize result to actual number of tokens result = try model.allocator.realloc(result, token_count); return result; } // Speculative decoding implementation fn speculativeGenerate( model: *TransformerModel, draft_model: *TransformerModel, result: []usize, token_count: usize, position: usize, config: GenerationConfig, callback: ?fn ([]const u8) void, ) ![]usize { // Implementation of speculative decoding algorithm // This generates multiple tokens using a smaller draft model // and verifies them with the main model for faster generation // ... implementation details ... return result; } // Sample next token using top-p (nucleus) and top-k sampling fn sampleNextToken( allocator: std.mem.Allocator, logits: Tensor(f32, 2), top_p: f32, top_k: usize, random: *std.rand.Random, ) !usize { const vocab_size = logits.shape[1]; // Create a sorted list of (token_id, probability) pairs var token_probs = try allocator.alloc( struct { token_id: usize, prob: f32 }, vocab_size, ); defer allocator.free(token_probs); // Apply softmax to get probabilities var probs = try softmax(allocator, logits); defer probs.deinit(); // Fill token_probs array for (0..vocab_size) |i| { token_probs[i] = .{ .token_id = i, .prob = probs.data[i], }; } // Sort by probability (descending) std.sort.sort( struct { token_id: usize, prob: f32 }, token_probs, {}, struct { fn lessThan(_: void, a: struct { token_id: usize, prob: f32 }, b: struct { token_id: usize, prob: f32 }) bool { return b.prob < a.prob; } }.lessThan, ); // Apply top-k filtering if enabled const k = if (top_k > 0) @min(top_k, vocab_size) else vocab_size; // Apply top-p filtering var cumulative_prob: f32 = 0.0; var last_idx: usize = 0; for (token_probs[0..k], 0..) |tp, i| { cumulative_prob += tp.prob; if (cumulative_prob >= top_p) { last_idx = i; break; } } // Sample from the filtered distribution const rand_val = random.float(f32); var curr_prob: f32 = 0.0; for (token_probs[0..last_idx+1]) |tp| { curr_prob += tp.prob; if (rand_val < curr_prob) { return tp.token_id; } } // Fallback to the highest probability token return token_probs[0].token_id; } ``` **Advanced Features:** 1. **Speculative Decoding** - Implementation of speculative decoding using a smaller draft model - Verification and acceptance/rejection of speculated tokens - Significant speedup in generation throughput 2. **Streaming Token Output** - Callback-based token streaming for real-time results - Zero-copy token decoding for minimal overhead - Support for incremental UI updates 3. **Custom Sampling Strategies** - Top-p (nucleus) sampling with dynamic probability mass cutoff - Top-k sampling with configurable k value - Temperature scaling for controlling randomness - Repetition penalty to prevent loops and repetitive text - Frequency and presence penalties for more diverse output 4. **Stop Sequence Detection** - Efficient detection of multiple stop sequences - Support for subword token matching across boundaries - Early termination based on generated content 5. **Beam Search Implementation** - Configurable beam width for exploring multiple generation paths - Length normalization for balancing short and long outputs - Diverse beam groups to prevent similar outputs 6. **Memory Efficiency** - KV-cache memory management for long context handling - Incremental cache updates for streaming inference - Automatic cache pruning for memory optimization 7. **Performance Optimizations** - Batched token processing for higher throughput - Parallel sampling for multi-sequence generation - SIMD-accelerated logit processing - Compile-time specialization for common configuration patterns ### 5. Optimization Layer The optimization layer leverages Zig's unique features to maximise performance across different hardware targets. #### 5.1 Compile-Time Optimizations Zig's powerful compile-time metaprogramming enables us to generate highly specialized code for specific hardware and model configurations: ```zig // Specialized matrix multiplication kernels generated at compile-time pub fn generateMatmulKernel(comptime config: KernelConfig) type { return struct { const Self = @This(); // Compile-time configuration const M = config.M; const N = config.N; const K = config.K; const block_size = config.block_size; const vector_width = config.vector_width; const use_fma = config.use_fma; // Vector type based on configuration const Vec = @Vector(vector_width, f32); // Matmul implementation specialized for the given dimensions pub fn matmul( a: *const [M][K]f32, b: *const [K][N]f32, c: *[M][N]f32, ) void { // Use specialized implementation for small matrices if (comptime M <= 4 and N <= 4 and K <= 4) { return smallMatmul(a, b, c); } // Use blocked implementation for larger matrices return blockedMatmul(a, b, c); } // Specialized implementation for small matrices // Fully unrolled at compile time fn smallMatmul( a: *const [M][K]f32, b: *const [K][N]f32, c: *[M][N]f32, ) void { inline for (0..M) |i| { inline for (0..N) |j| { var sum: f32 = 0; inline for (0..K) |k| { sum += a[i][k] * b[k][j]; } c[i][j] = sum; } } } // Cache-blocked implementation for larger matrices fn blockedMatmul( a: *const [M][K]f32, b: *const [K][N]f32, c: *[M][N]f32, ) void { // Compute using blocks for better cache utilization comptime var i_block: usize = 0; inline while (i_block < M) : (i_block += block_size) { comptime var j_block: usize = 0; inline while (j_block < N) : (j_block += block_size) { comptime var k_block: usize = 0; inline while (k_block < K) : (k_block += block_size) { const i_end = @min(i_block + block_size, M); const j_end = @min(j_block + block_size, N); const k_end = @min(k_block + block_size, K); // Process current block for (i_block..i_end) |i| { for (j_block..j_end) |j| { var sum: f32 = c[i][j]; // Vectorized inner loop when possible if (comptime vector_width > 1 and (k_end - k_block) >= vector_width) { var k_vec: usize = k_block; var acc: Vec = @splat(0.0); while (k_vec + vector_width <= k_end) : (k_vec += vector_width) { const a_vec: Vec = blk: { var tmp: [vector_width]f32 = undefined; for (0..vector_width) |vi| { tmp[vi] = a[i][k_vec + vi]; } break :blk tmp; }; const b_vec: Vec = blk: { var tmp: [vector_width]f32 = undefined; for (0..vector_width) |vi| { tmp[vi] = b[k_vec + vi][j]; } break :blk tmp; }; // Use FMA instruction if available if (comptime use_fma) { acc = @mulAdd(Vec, a_vec, b_vec, acc); } else { acc += a_vec * b_vec; } } // Reduce vector to scalar for (0..vector_width) |vi| { sum += acc[vi]; } // Handle remaining elements for (k_vec..k_end) |k| { sum += a[i][k] * b[k][j]; } } else { // Scalar fallback for (k_block..k_end) |k| { sum += a[i][k] * b[k][j]; } } c[i][j] = sum; } } } } } } }; } // Configuration for kernel generation pub const KernelConfig = struct { // Matrix dimensions (can be comptime_int or dynamic) M: comptime_int, N: comptime_int, K: comptime_int, // Blocking configuration for cache optimization block_size: comptime_int = 32, // Vector width for SIMD operations vector_width: comptime_int = 4, // Whether to use FMA instructions when available use_fma: bool = true, }; // Usage: Create specialized kernels at compile time // Fully unrolled 4x4 matrix multiplication const Kernel4x4 = generateMatmulKernel(.{ .M = 4, .N = 4, .K = 4, .vector_width = 4, }); // Cache-friendly 128x128 matrix multiplication const Kernel128x128 = generateMatmulKernel(.{ .M = 128, .N = 128, .K = 128, .block_size = 32, .vector_width = 8, }); // Runtime dispatch to select the best kernel based on matrix dimensions pub fn dispatchMatmul( allocator: std.mem.Allocator, a: Tensor(f32, 2), b: Tensor(f32, 2), ) !Tensor(f32, 2) { // Check dimensions const m = a.shape[0]; const k = a.shape[1]; const n = b.shape[1]; std.debug.assert(k == b.shape[0], "Incompatible matrix dimensions"); // Create result tensor var result = try Tensor(f32, 2).init(allocator, .{m, n}); errdefer result.deinit(); // Initialize result to zeros @memset(result.data, 0); // Dispatch to specialized kernels if dimensions match exactly if (m == 4 and n == 4 and k == 4) { // Use specialized 4x4 kernel Kernel4x4.matmul( @ptrCast(*const [4][4]f32, a.data), @ptrCast(*const [4][4]f32, b.data), @ptrCast(*[4][4]f32, result.data), ); } else if (m == 128 and n == 128 and k == 128) { // Use specialized 128x128 kernel Kernel128x128.matmul( @ptrCast(*const [128][128]f32, a.data), @ptrCast(*const [128][128]f32, b.data), @ptrCast(*[128][128]f32, result.data), ); } else { // Use generic implementation for arbitrary dimensions try genericMatmul(a, b, &result); } return result; } // Apply compile-time metaprogramming to optimize data layouts pub fn optimizedTensorLayout(comptime T: type, comptime dims: []const usize) type { return struct { const Self = @This(); // Determine optimal memory layout at compile time const optimal_layout = optimizeMemoryLayout(T, dims); // Data storage with optimized layout data: [product(dims)]T align(optimal_layout.alignment), shape: [dims.len]usize, strides: [dims.len]usize, // Tensor initialization with optimal layout pub fn init(allocator: std.mem.Allocator) !Self { const data = try allocator.alignedAlloc( T, optimal_layout.alignment, product(dims), ); // Calculate optimal strides based on layout var strides: [dims.len]usize = undefined; if (optimal_layout.row_major) { // Row-major strides var stride: usize = 1; var i: usize = dims.len; while (i > 0) { i -= 1; strides[i] = stride; stride *= dims[i]; } } else { // Column-major strides var stride: usize = 1; for (0..dims.len) |i| { strides[i] = stride; stride *= dims[i]; } } return Self{ .data = data, .shape = dims, .strides = strides, }; } // Helper function to calculate optimal memory layout fn optimizeMemoryLayout(comptime T: type, comptime dims: []const usize) struct { row_major: bool, alignment: u29, } { // Use column-major for matrices where the first dimension is much larger // This often improves cache locality for common access patterns const row_major = if (dims.len == 2) dims[0] <= dims[1] * 2 else true; // Determine optimal alignment based on vector units const alignment = if (@sizeOf(T) == 4 and comptime std.Target.current.cpu.arch == .x86_64) if (comptime std.Target.current.cpu.features.isEnabled(.avx512f)) 64 // 512-bit alignment for AVX-512 else if (comptime std.Target.current.cpu.features.isEnabled(.avx2)) 32 // 256-bit alignment for AVX2 else if (comptime std.Target.current.cpu.features.isEnabled(.sse2)) 16 // 128-bit alignment for SSE2 else @alignOf(T) else @alignOf(T); return .{ .row_major = row_major, .alignment = alignment, }; } // Helper to calculate the product of dimensions fn product(comptime dims: []const usize) usize { var result: usize = 1; for (dims) |dim| { result *= dim; } return result; } }; } ``` **Key Compile-Time Techniques:** 1. **Matrix Operation Specialization** - Specialized kernels generated at compile-time for common dimensions - Full loop unrolling for small matrices - Compile-time configurable blocking strategies for cache optimization 2. **Data Layout Optimization** - Automatic selection of row-major or column-major layout based on dimensions - Optimal memory alignment for target architecture's vector units - Compile-time stride calculation for fast indexing 3. **Architecture-Specific Optimizations** - Vector width specialization based on target CPU features - Automatic use of FMA instructions when available - SIMD instruction generation tailored to the target architecture 4. **Kernel Selection** - Runtime dispatch to specialized kernels based on input dimensions - Fallback to generic implementation for arbitrary dimensions - Compile-time branch elimination for performance-critical paths #### 5.2 Quantization Framework Our quantization framework allows for efficient low-precision inference while maintaining accuracy: ```zig // Quantization configuration pub const QuantizationConfig = struct { // Precision of quantized values bits: u8 = 8, // Quantization scheme scheme: enum { symmetric, // Zero-point is always 0, simplifies arithmetic asymmetric, // Allows representing the full range more precisely } = .symmetric, // Quantization granularity granularity: enum { per_tensor, // One scale for the entire tensor per_channel, // Different scale for each output channel } = .per_tensor, // Whether to use integer or float16 quantization use_float16: bool = false, // Calibration strategy calibration: enum { minmax, // Simple min/max scaling entropy, // Entropy-based quantization percentile, // Clip to percentile range for outliers } = .minmax, // Percentile value for calibration (0.0-1.0) percentile: f32 = 0.99995, }; // Quantized tensor type that tracks quantization parameters pub fn QuantizedTensor(comptime original_type: type, comptime bits: u8) type { return struct { const Self = @This(); // Determine the appropriate integer type based on bit width const IntType = std.meta.Int(.unsigned, bits); // Original element type for reference pub const OriginalType = original_type; // Quantized data data: []IntType, // Original tensor shape shape: []const usize, // Quantization parameters scale: []f32, zero_point: []IntType, // Whether scale/zero_point are per-tensor or per-channel per_channel: bool, // For asymmetric quantization: minimum representable value qmin: IntType, // For asymmetric quantization: maximum representable value qmax: IntType, // Channel dimension for per-channel quantization channel_dim: ?usize, // Memory allocator for cleanup allocator: std.mem.Allocator, // Initialize a quantized tensor pub fn init( allocator: std.mem.Allocator, shape: []const usize, per_channel: bool, channel_dim: ?usize, ) !Self { // Calculate total size var total_size: usize = 1; for (shape) |dim| { total_size *= dim; } // Determine number of scales/zero_points needed const param_size = if (per_channel) shape[channel_dim.?] else 1; // Allocate memory const data = try allocator.alloc(IntType, total_size); errdefer allocator.free(data); const scale = try allocator.alloc(f32, param_size); errdefer allocator.free(scale); const zero_point = try allocator.alloc(IntType, param_size); errdefer allocator.free(zero_point); // Calculate quantization range const qmin: IntType = 0; const qmax: IntType = (1 << bits) - 1; // Create shape copy const shape_copy = try allocator.dupe(usize, shape); errdefer allocator.free(shape_copy); return Self{ .data = data, .shape = shape_copy, .scale = scale, .zero_point = zero_point, .per_channel = per_channel, .qmin = qmin, .qmax = qmax, .channel_dim = channel_dim, .allocator = allocator, }; } // Free allocated memory pub fn deinit(self: *Self) void { self.allocator.free(self.data); self.allocator.free(self.scale); self.allocator.free(self.zero_point); self.allocator.free(self.shape); } }; } // Quantize a floating-point tensor to integer precision pub fn quantize( tensor: anytype, config: QuantizationConfig, allocator: std.mem.Allocator, ) !QuantizedTensor( @TypeOf(tensor.data[0]), config.bits, ) { const T = @TypeOf(tensor.data[0]); // Validate input if (config.bits > 16) { return error.UnsupportedQuantizationBits; } if (config.granularity == .per_channel and config.calibration != .minmax) { return error.UnsupportedCombination; } // Create quantized tensor var channel_dim: ?usize = null; if (config.granularity == .per_channel) { // For per-channel quantization, use dimension 0 for vectors, // dimension 1 for matrices (assuming CHW layout) channel_dim = if (tensor.shape.len == 1) 0 else 1; } var qtensor = try QuantizedTensor(T, config.bits).init( allocator, tensor.shape, config.granularity == .per_channel, channel_dim, ); errdefer qtensor.deinit(); // Different calibration strategies switch (config.calibration) { .minmax => try calibrateMinMax(&qtensor, tensor, config), .entropy => try calibrateEntropy(&qtensor, tensor, config), .percentile => try calibratePercentile(&qtensor, tensor, config), } // Perform actual quantization try quantizeTensor(&qtensor, tensor, config); return qtensor; } // Dequantize a tensor back to floating point pub fn dequantize( qtensor: anytype, allocator: std.mem.Allocator, ) !Tensor(@TypeOf(qtensor).OriginalType, qtensor.shape.len) { const T = @TypeOf(qtensor).OriginalType; // Create tensor to hold dequantized values var tensor = try Tensor(T, qtensor.shape.len).init( allocator, qtensor.shape, ); errdefer tensor.deinit(); // Dequantize values if (qtensor.per_channel) { const channel_dim = qtensor.channel_dim.?; const channels = qtensor.shape[channel_dim]; // Calculate strides for traversing channels var strides: []usize = try allocator.alloc(usize, qtensor.shape.len); defer allocator.free(strides); var stride: usize = 1; var i: usize = qtensor.shape.len; while (i > 0) { i -= 1; strides[i] = stride; stride *= qtensor.shape[i]; } // Dequantize each element based on its channel for (0..tensor.data.len) |idx| { const channel_idx = (idx / strides[channel_dim]) % channels; const scale = qtensor.scale[channel_idx]; const zero_point = qtensor.zero_point[channel_idx]; tensor.data[idx] = @floatCast(T, @intToFloat(f32, qtensor.data[idx] - zero_point) * scale ); } } else { // Per-tensor dequantization (simpler) const scale = qtensor.scale[0]; const zero_point = qtensor.zero_point[0]; for (0..tensor.data.len) |i| { tensor.data[i] = @floatCast(T, @intToFloat(f32, qtensor.data[i] - zero_point) * scale ); } } return tensor; } // Calibrate using simple min/max strategy fn calibrateMinMax( qtensor: anytype, tensor: anytype, config: QuantizationConfig, ) !void { if (config.granularity == .per_tensor) { // Find min/max across entire tensor var min_val: f32 = std.math.inf_f32; var max_val: f32 = -std.math.inf_f32; for (tensor.data) |val| { const fval = @floatCast(f32, val); min_val = @min(min_val, fval); max_val = @max(max_val, fval); } // Handle symmetric quantization if (config.scheme == .symmetric) { const abs_max = @max(@abs(min_val), @abs(max_val)); min_val = -abs_max; max_val = abs_max; } // Calculate scale and zero_point const range = max_val - min_val; qtensor.scale[0] = range / @intToFloat(f32, qtensor.qmax - qtensor.qmin); if (config.scheme == .symmetric) { qtensor.zero_point[0] = @divFloor(qtensor.qmax - qtensor.qmin, 2) + qtensor.qmin; } else { qtensor.zero_point[0] = @floatToInt( @TypeOf(qtensor.zero_point[0]), @round(qtensor.qmin - min_val / qtensor.scale[0]) ); } } else { // Per-channel quantization // ... implementation details ... } } // Perform actual quantization fn quantizeTensor( qtensor: anytype, tensor: anytype, config: QuantizationConfig, ) !void { if (qtensor.per_channel) { // Per-channel quantization // ... implementation details ... } else { // Per-tensor quantization const scale = qtensor.scale[0]; const zero_point = qtensor.zero_point[0]; const qmin = qtensor.qmin; const qmax = qtensor.qmax; for (0..tensor.data.len) |i| { const val = @floatCast(f32, tensor.data[i]); // Quantize: x_q = round(x / scale) + zero_point var q_val = @floatToInt( @TypeOf(qtensor.data[0]), @round(val / scale) + @intToFloat(f32, zero_point) ); // Clamp to quantization range q_val = @max(@min(q_val, qmax), qmin); qtensor.data[i] = q_val; } } } ``` **Quantization Features:** 1. **Multiple Precision Options** - 8-bit quantization for maximum throughput - 4-bit quantization for model compression - 3-bit quantization for extreme size reduction - FP16 quantization for memory bandwidth reduction with minimal accuracy loss 2. **Flexible Quantization Schemes** - Symmetric quantization for simpler arithmetic - Asymmetric quantization for better range utilization - Per-tensor quantization for speed - Per-channel quantization for accuracy 3. **Advanced Calibration Methods** - Min/max calibration for simplicity - Entropy-based calibration for better distribution representation - Percentile-based calibration for outlier handling 4. **Mixed-Precision Execution** - Critical layers in higher precision for accuracy - Non-critical layers in lower precision for speed - Automatic precision selection based on sensitivity analysis 5. **Hardware Acceleration** - Optimized integer SIMD operations for quantized execution - Specialized kernels for common quantized operations - Hardware-specific optimizations for quantized compute ## Platform-Specific Optimizations ### Apple Silicon (M-Series) The DeepSeek V3 Zig implementation is highly optimized for Apple Silicon's unique architecture: 1. **Metal Performance Shaders (MPS) Integration** - Direct integration with Apple's Metal Performance Shaders for matrix operations - Custom Metal compute kernels optimized for M-series chips - Efficient memory sharing between CPU and GPU with zero-copy transfers 2. **Tensor Core Utilization** - Leveraging Matrix multiplication units in M-series chips - Mixed-precision operations optimized for Apple Silicon - Native FP16 support for improved throughput 3. **AMX Instruction Set Access** - Direct use of Apple Matrix extensions for accelerated linear algebra - Low-level optimization of critical matrix operations - Custom assembly routines for maximum performance 4. **Memory Bandwidth Optimization** - Unified memory architecture exploitation - Cache-friendly memory access patterns - Optimal tile sizes for M-series cache hierarchy 5. **Power Efficiency Tuning** - Dynamic performance/power scaling - Efficient core utilization across P and E cores - Background inference optimizations ### x86_64 Architecture For x86_64 platforms, our implementation focuses on leveraging the latest instruction sets: 1. **AVX-512 Vectorization** - Full utilization of 512-bit vector operations - Masked operations for efficient boundary handling - FMA instruction usage for maximum throughput 2. **Cache-Friendly Memory Layouts** - Cache line aligned data structures - Blocked algorithms optimized for typical L1/L2/L3 cache sizes - Software prefetching for critical data paths 3. **Thread Pool Optimization** - Work-stealing scheduler for balanced multicore utilization - NUMA-aware memory allocation and thread assignment - Adaptive parallelism based on available cores 4. **Dynamic Dispatch** - Runtime CPU feature detection - Specialized code paths for different instruction sets - Fallback implementations for compatibility ### NVIDIA GPUs NVIDIA GPU acceleration is implemented through an efficient CUDA integration: 1. **CUDA Integration via FFI** - Zero-overhead bindings to CUDA runtime - Asynchronous kernel execution and memory transfers - Efficient stream management for overlapping operations 2. **Custom CUDA Kernels** - Specialized kernels for attention mechanisms - Optimized matrix multiplication for transformer layers - Fused operations for reduced kernel launch overhead 3. **Memory Management** - Pinned memory for efficient transfers - Memory pool for reduced allocation overhead - Smart prefetching for predictable memory access patterns 4. **Tensor Core Utilization** - Mixed-precision operations using TensorCores - Automatic kernel selection for tensor-core eligible operations - Tensor Core compatible memory layouts ## Development Roadmap ### Phase 1: Core Infrastructure The initial phase focuses on establishing the foundational components: - **Memory Management System** - Custom tensor allocator implementation - Arena-based allocation strategies - Error handling framework - **Tensor Implementation** - Basic tensor operations and utilities - SIMD-accelerated implementations - Platform detection and optimization - **Computation Backend Interfaces** - Abstract backend interfaces - CPU backend implementation - Initial Metal backend for Apple Silicon - **Error Handling Framework** - Robust error propagation - Detailed error reporting - Resource cleanup guarantees ### Phase 2: Model Architecture Building on the infrastructure, we implement the core model components: - **Transformer Layers** - Multi-head attention implementation - Feed-forward networks - Layer normalization - **Attention Mechanisms** - Standard attention implementation - Flash attention optimizations - Memory-efficient attention variants - **Mixture of Experts** - Router implementation - Parallel expert execution - Load balancing mechanisms - **Embedding Systems** - Token embeddings - Position embeddings - Rotary position embeddings ### Phase 3: Backend Integration This phase extends compute capabilities across different hardware: - **CPU Backend** - AVX-512 optimizations - Thread pool implementation - Cache-optimized algorithms - **Metal Backend** - Complete Metal shader library - Apple Neural Engine integration - M-series specific optimizations - **CUDA Backend** - NVIDIA GPU support - Tensor Core optimizations - Multi-GPU scaling - **Vulkan Backend** - Cross-platform GPU support - AMD GPU optimizations - Intel GPU support ### Phase 4: Inference Pipeline Creating the end-to-end inference system: - **Model Loading** - SafeTensors format support - Checkpoint loading - Weight quantization - **Tokenization** - Efficient tokenizer implementation - Streaming tokenization - Special token handling - **Generation Strategies** - Sampling methods implementation - Beam search - Speculative decoding - **Output Processing** - Token streaming - Stop sequence handling - Result formatting ### Phase 5: Optimization Comprehensive optimization across the entire stack: - **Compile-Time Optimizations** - Template specialization - Kernel generation - Custom data layouts - **Runtime Optimizations** - Dynamic kernel selection - Adaptive compute strategies - Memory access optimizations - **Architecture-Specific Tuning** - Platform-specific parameter tuning - Hardware-specific kernel variants - Feature detection and adaptation - **Quantization Framework** - 8-bit quantization - 4-bit quantization - Mixed precision execution ### Phase 6: Testing and Benchmarking Ensuring correctness and measuring performance: - **Comprehensive Test Suite** - Unit tests for all components - Integration tests for end-to-end validation - Conformance tests against reference implementation - **Benchmarking Framework** - Performance measurement tools - Comparison with PyTorch implementation - Memory usage analysis - **Platform Benchmarks** - Apple Silicon performance - x86_64 performance - NVIDIA GPU performance - **Fine-Tuning** - Performance bottleneck identification - Targeted optimizations - Final parameter tuning ## Why DeepSeek V3 in Zig? The migration of DeepSeek V3 to Zig represents a significant advancement in language model implementation. By leveraging Zig's unique features, particularly compile-time metaprogramming and fine-grained memory control, we aim to create a highly optimized implementation that outperforms the original Python/PyTorch version significantly while maintaining flexibility and ease of use. Key advantages of the Zig implementation include: 1. **Superior Performance** - Compile-time specialization eliminates runtime overhead - Direct hardware access for maximum efficiency - Zero-cost abstractions for clean yet fast code 2. **Memory Efficiency** - Explicit allocation strategies tailored to LLM workloads - Reduced memory fragmentation - Lower overall memory footprint 3. **Reliability** - Comprehensive error handling - No runtime exceptions - Deterministic resource cleanup 4. **Portability** - Cross-platform support with optimized backends - Consistent behavior across environments - Single codebase for all target platforms The resulting system will be particularly well-suited for deployment on resource-constrained devices and will provide superior performance on all platforms. This architectural approach sets the foundation for future innovations in large language model deployment.