diff --git a/README-DEEPSEEK_LEGACY.md b/README-DEEPSEEK_LEGACY.md new file mode 100644 index 0000000..7ac034c --- /dev/null +++ b/README-DEEPSEEK_LEGACY.md @@ -0,0 +1,361 @@ + + + + +
+ DeepSeek-V3 +
+
+
+ Homepage + Chat + Hugging Face +
+ Discord + Wechat + Twitter Follow +
+ Code License + Model License +
+ Paper Link👁️ +
+ +## Table of Contents + +1. [Introduction](#1-introduction) +2. [Model Summary](#2-model-summary) +3. [Model Downloads](#3-model-downloads) +4. [Evaluation Results](#4-evaluation-results) +5. [Chat Website & API Platform](#5-chat-website--api-platform) +6. [How to Run Locally](#6-how-to-run-locally) +7. [License](#7-license) +8. [Citation](#8-citation) +9. [Contact](#9-contact) + + +## 1. Introduction + +We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. +To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. +Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. +We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. +Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. +Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. +In addition, its training process is remarkably stable. +Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. +

+ +

+ +## 2. Model Summary + +--- + +**Architecture: Innovative Load Balancing Strategy and Training Objective** + +- On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing. +- We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. + It can also be used for speculative decoding for inference acceleration. + +--- + +**Pre-Training: Towards Ultimate Training Efficiency** + +- We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model. +- Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap. + This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead. +- At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours. + +--- + +**Post-Training: Knowledge Distillation from DeepSeek-R1** + +- We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3. + +--- + + +## 3. Model Downloads + +
+ +| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** | +| :------------: | :------------: | :------------: | :------------: | :------------: | +| DeepSeek-V3-Base | 671B | 37B | 128K | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) | +| DeepSeek-V3 | 671B | 37B | 128K | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V3) | + +
+ +> [!NOTE] +> The total size of DeepSeek-V3 models on Hugging Face is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights. + +To ensure optimal performance and flexibility, we have partnered with open-source communities and hardware vendors to provide multiple ways to run the model locally. For step-by-step guidance, check out Section 6: [How_to Run_Locally](#6-how-to-run-locally). + +For developers looking to dive deeper, we recommend exploring [README_WEIGHTS.md](./README_WEIGHTS.md) for details on the Main Model weights and the Multi-Token Prediction (MTP) Modules. Please note that MTP support is currently under active development within the community, and we welcome your contributions and feedback. + +## 4. Evaluation Results +### Base Model +#### Standard Benchmarks + +
+ + +| | Benchmark (Metric) | # Shots | DeepSeek-V2 | Qwen2.5 72B | LLaMA3.1 405B | DeepSeek-V3 | +|---|-------------------|----------|--------|-------------|---------------|---------| +| | Architecture | - | MoE | Dense | Dense | MoE | +| | # Activated Params | - | 21B | 72B | 405B | 37B | +| | # Total Params | - | 236B | 72B | 405B | 671B | +| English | Pile-test (BPB) | - | 0.606 | 0.638 | **0.542** | 0.548 | +| | BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | **87.5** | +| | MMLU (Acc.) | 5-shot | 78.4 | 85.0 | 84.4 | **87.1** | +| | MMLU-Redux (Acc.) | 5-shot | 75.6 | 83.2 | 81.3 | **86.2** | +| | MMLU-Pro (Acc.) | 5-shot | 51.4 | 58.3 | 52.8 | **64.4** | +| | DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | **89.0** | +| | ARC-Easy (Acc.) | 25-shot | 97.6 | 98.4 | 98.4 | **98.9** | +| | ARC-Challenge (Acc.) | 25-shot | 92.2 | 94.5 | **95.3** | **95.3** | +| | HellaSwag (Acc.) | 10-shot | 87.1 | 84.8 | **89.2** | 88.9 | +| | PIQA (Acc.) | 0-shot | 83.9 | 82.6 | **85.9** | 84.7 | +| | WinoGrande (Acc.) | 5-shot | **86.3** | 82.3 | 85.2 | 84.9 | +| | RACE-Middle (Acc.) | 5-shot | 73.1 | 68.1 | **74.2** | 67.1 | +| | RACE-High (Acc.) | 5-shot | 52.6 | 50.3 | **56.8** | 51.3 | +| | TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | **82.9** | +| | NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | **41.5** | 40.0 | +| | AGIEval (Acc.) | 0-shot | 57.5 | 75.8 | 60.6 | **79.6** | +| Code | HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | **65.2** | +| | MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | **75.4** | +| | LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | **19.4** | +| | CRUXEval-I (Acc.) | 2-shot | 52.5 | 59.1 | 58.5 | **67.3** | +| | CRUXEval-O (Acc.) | 2-shot | 49.8 | 59.9 | 59.9 | **69.8** | +| Math | GSM8K (EM) | 8-shot | 81.6 | 88.3 | 83.5 | **89.3** | +| | MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | **61.6** | +| | MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | **79.8** | +| | CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | **90.7** | +| Chinese | CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | **83.0** | 82.7 | +| | C-Eval (Acc.) | 5-shot | 81.4 | 89.2 | 72.5 | **90.1** | +| | CMMLU (Acc.) | 5-shot | 84.0 | **89.5** | 73.7 | 88.8 | +| | CMRC (EM) | 1-shot | **77.4** | 75.8 | 76.0 | 76.3 | +| | C3 (Acc.) | 0-shot | 77.4 | 76.7 | **79.7** | 78.6 | +| | CCPM (Acc.) | 0-shot | **93.0** | 88.5 | 78.6 | 92.0 | +| Multilingual | MMMLU-non-English (Acc.) | 5-shot | 64.0 | 74.8 | 73.8 | **79.4** | + +
+ +> [!NOTE] +> Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks. +> For more evaluation details, please check our paper. + +#### Context Window +

+ +

+ +Evaluation results on the ``Needle In A Haystack`` (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to **128K**. + +### Chat Model +#### Standard Benchmarks (Models larger than 67B) +
+ +| | **Benchmark (Metric)** | **DeepSeek V2-0506** | **DeepSeek V2.5-0905** | **Qwen2.5 72B-Inst.** | **Llama3.1 405B-Inst.** | **Claude-3.5-Sonnet-1022** | **GPT-4o 0513** | **DeepSeek V3** | +|---|---------------------|---------------------|----------------------|---------------------|----------------------|---------------------------|----------------|----------------| +| | Architecture | MoE | MoE | Dense | Dense | - | - | MoE | +| | # Activated Params | 21B | 21B | 72B | 405B | - | - | 37B | +| | # Total Params | 236B | 236B | 72B | 405B | - | - | 671B | +| English | MMLU (EM) | 78.2 | 80.6 | 85.3 | **88.6** | **88.3** | 87.2 | **88.5** | +| | MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | **88.9** | 88.0 | **89.1** | +| | MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | **78.0** | 72.6 | 75.9 | +| | DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | **91.6** | +| | IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | **86.5** | 84.3 | 86.1 | +| | GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | **65.0** | 49.9 | 59.1 | +| | SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | **38.2** | 24.9 | +| | FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | **80.5** | 73.3 | +| | LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | **48.7** | +| Code | HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | **82.6** | +| | LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | **40.5** | +| | LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | **37.6** | +| | Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | **51.6** | +| | SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | **50.8** | 38.8 | 42.0 | +| | Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | **84.2** | 72.9 | 79.7 | +| | Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | **49.6** | +| Math | AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | **39.2** | +| | MATH-500 (EM) | 56.3 | 74.7 | 80.0 | 73.8 | 78.3 | 74.6 | **90.2** | +| | CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | **43.2** | +| Chinese | CLUEWSC (EM) | 89.9 | 90.4 | **91.4** | 84.7 | 85.4 | 87.9 | 90.9 | +| | C-Eval (EM) | 78.6 | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | **86.5** | +| | C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | **64.8** | + +
+ +> [!NOTE] +> All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models. + + +#### Open Ended Generation Evaluation + +
+ + + +| Model | Arena-Hard | AlpacaEval 2.0 | +|-------|------------|----------------| +| DeepSeek-V2.5-0905 | 76.2 | 50.5 | +| Qwen2.5-72B-Instruct | 81.2 | 49.1 | +| LLaMA-3.1 405B | 69.3 | 40.5 | +| GPT-4o-0513 | 80.4 | 51.1 | +| Claude-Sonnet-3.5-1022 | 85.2 | 52.0 | +| DeepSeek-V3 | **85.5** | **70.0** | + +
+ +> [!NOTE] +> English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric. + + +## 5. Chat Website & API Platform +You can chat with DeepSeek-V3 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com/sign_in) + +We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) + +## 6. How to Run Locally + +DeepSeek-V3 can be deployed locally using the following hardware and open-source community software: + +1. **DeepSeek-Infer Demo**: We provide a simple and lightweight demo for FP8 and BF16 inference. +2. **SGLang**: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes, with Multi-Token Prediction [coming soon](https://github.com/sgl-project/sglang/issues/2591). +3. **LMDeploy**: Enables efficient FP8 and BF16 inference for local and cloud deployment. +4. **TensorRT-LLM**: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon. +5. **vLLM**: Support DeepSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism. +6. **LightLLM**: Supports efficient single-node or multi-node deployment for FP8 and BF16. +7. **AMD GPU**: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes. +8. **Huawei Ascend NPU**: Supports running DeepSeek-V3 on Huawei Ascend devices. + +Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation. + +Here is an example of converting FP8 weights to BF16: + +```shell +cd inference +python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights +``` + +> [!NOTE] +> Hugging Face's Transformers has not been directly supported yet. + +### 6.1 Inference with DeepSeek-Infer Demo (example only) + +#### System Requirements + +> [!NOTE] +> Linux with Python 3.10 only. Mac and Windows are not supported. + +Dependencies: +```pip-requirements +torch==2.4.1 +triton==3.0.0 +transformers==4.46.3 +safetensors==0.4.5 +``` +#### Model Weights & Demo Code Preparation + +First, clone our DeepSeek-V3 GitHub repository: + +```shell +git clone https://github.com/deepseek-ai/DeepSeek-V3.git +``` + +Navigate to the `inference` folder and install dependencies listed in `requirements.txt`. Easiest way is to use a package manager like `conda` or `uv` to create a new virtual environment and install the dependencies. + +```shell +cd DeepSeek-V3/inference +pip install -r requirements.txt +``` + +Download the model weights from Hugging Face, and put them into `/path/to/DeepSeek-V3` folder. + +#### Model Weights Conversion + +Convert Hugging Face model weights to a specific format: + +```shell +python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16 +``` + +#### Run + +Then you can chat with DeepSeek-V3: + +```shell +torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200 +``` + +Or batch inference on a given file: + +```shell +torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE +``` + +### 6.2 Inference with SGLang (recommended) + +[SGLang](https://github.com/sgl-project/sglang) currently supports [MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations), [DP Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), FP8 (W8A8), FP8 KV Cache, and Torch Compile, delivering state-of-the-art latency and throughput performance among open-source frameworks. + +Notably, [SGLang v0.4.1](https://github.com/sgl-project/sglang/releases/tag/v0.4.1) fully supports running DeepSeek-V3 on both **NVIDIA and AMD GPUs**, making it a highly versatile and robust solution. + +SGLang also supports [multi-node tensor parallelism](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208), enabling you to run this model on multiple network-connected machines. + +Multi-Token Prediction (MTP) is in development, and progress can be tracked in the [optimization plan](https://github.com/sgl-project/sglang/issues/2591). + +Here are the launch instructions from the SGLang team: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 + +### 6.3 Inference with LMDeploy (recommended) +[LMDeploy](https://github.com/InternLM/lmdeploy), a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows. + +For comprehensive step-by-step instructions on running DeepSeek-V3 with LMDeploy, please refer to here: https://github.com/InternLM/lmdeploy/issues/2960 + + +### 6.4 Inference with TRT-LLM (recommended) + +[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/deepseek_v3. + + +### 6.5 Inference with vLLM (recommended) + +[vLLM](https://github.com/vllm-project/vllm) v0.6.6 supports DeepSeek-V3 inference for FP8 and BF16 modes on both NVIDIA and AMD GPUs. Aside from standard techniques, vLLM offers _pipeline parallelism_ allowing you to run this model on multiple machines connected by networks. For detailed guidance, please refer to the [vLLM instructions](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). Please feel free to follow [the enhancement plan](https://github.com/vllm-project/vllm/issues/11539) as well. + +### 6.6 Inference with LightLLM (recommended) + +[LightLLM](https://github.com/ModelTC/lightllm/tree/main) v1.0.1 supports single-machine and multi-machine tensor parallel deployment for DeepSeek-R1 (FP8/BF16) and provides mixed-precision deployment, with more quantization modes continuously integrated. For more details, please refer to [LightLLM instructions](https://lightllm-en.readthedocs.io/en/latest/getting_started/quickstart.html). Additionally, LightLLM offers PD-disaggregation deployment for DeepSeek-V2, and the implementation of PD-disaggregation for DeepSeek-V3 is in development. + +### 6.7 Recommended Inference Functionality with AMD GPUs + +In collaboration with the AMD team, we have achieved Day-One support for AMD GPUs using SGLang, with full compatibility for both FP8 and BF16 precision. For detailed guidance, please refer to the [SGLang instructions](#63-inference-with-lmdeploy-recommended). + +### 6.8 Recommended Inference Functionality with Huawei Ascend NPUs +The [MindIE](https://www.hiascend.com/en/software/mindie) framework from the Huawei Ascend community has successfully adapted the BF16 version of DeepSeek-V3. For step-by-step guidance on Ascend NPUs, please follow the [instructions here](https://modelers.cn/models/MindIE/deepseekv3). + + +## 7. License +This code repository is licensed under [the MIT License](LICENSE-CODE). The use of DeepSeek-V3 Base/Chat models is subject to [the Model License](LICENSE-MODEL). DeepSeek-V3 series (including Base and Chat) supports commercial use. + +## 8. Citation +``` +@misc{deepseekai2024deepseekv3technicalreport, + title={DeepSeek-V3 Technical Report}, + author={DeepSeek-AI}, + year={2024}, + eprint={2412.19437}, + archivePrefix={arXiv}, + primaryClass={cs.CL}, + url={https://arxiv.org/abs/2412.19437}, +} +``` + +## 9. Contact +If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com). diff --git a/README.md b/README.md index 7ac034c..7a0b5e6 100644 --- a/README.md +++ b/README.md @@ -1,361 +1,4615 @@ - - - -
- DeepSeek-V3 + DeepSeek V3 in Zig

-
- Homepage - Chat - Hugging Face +
+ Language: Zig + License: DeepSeek + Status: Proposal
- Discord - Wechat - Twitter Follow + Performance: High Efficiency + Platform: Cross Platform
- Code License - Model License -
- Paper Link👁️ + Feature: SIMD Optimized + Architecture: MoE + Backend: Customizable
+
+ +

DeepSeek V3 in Zig: `DeepZig V3` Proposal

+ +## Overview + +This document outlines the architecture for implementing DeepSeek V3 in the Zig programming language. The focus is on leveraging Zig's unique features to create a high-performance, memory-efficient, and robust implementation of the DeepSeek V3 architecture. + +1. **Superior Performance**: Leverage Zig's compile-time metaprogramming, SIMD vectorization, and low-level control to achieve optimal performance across platforms +2. **Memory Efficiency**: Utilize Zig's explicit allocator system and arena allocation patterns for precise resource management +3. **Concurrent Processing**: Implement efficient parallel execution using Zig's advanced async/await framework and evented I/O +4. **Type Safety & Reliability**: Employ Zig's strong type system, comptime checks, and explicit error handling to prevent runtime errors +5. **Cross-Platform Support**: Create a portable implementation with seamless support across architectures (x86_64, ARM64, etc.) ## Table of Contents +1. [Overview](#overview) +2. [System Architecture](#system-architecture) +3. [Component Design](#component-design) + 1. [Core System](#core-system) + 1. [Memory Management System](#memory-management-system) + 2. [Tensor Implementation](#tensor-implementation) + 3. [Error Handling Framework](#error-handling-framework) + 4. [Concurrency Model](#concurrency-model) + 2. [Model Architecture](#model-architecture) + 1. [Transformer Core](#transformer-core) + 2. [Attention Mechanisms](#attention-mechanisms) + 3. [Mixture of Experts (MoE)](#mixture-of-experts-moe) + 3. [Computation Backend](#computation-backend) + 1. [Backend Interface](#backend-interface) + 2. [Metal Integration for Apple Silicon](#metal-integration-for-apple-silicon) + 4. [Inference Pipeline](#inference-pipeline) + 1. [Model Loading](#model-loading) + 2. [Generation Strategies](#generation-strategies) + 5. [Optimization Layer](#optimization-layer) + 1. [Compile-Time Optimizations](#compile-time-optimizations) + 2. [Quantization Framework](#quantization-framework) +4. [Platform-Specific Optimizations](#platform-specific-optimizations) +5. [Development Roadmap](#development-roadmap) +6. [Why Propose DeepSeek V3 in Zig?](#why-propose-deepseek-v3-in-zig) -1. [Introduction](#1-introduction) -2. [Model Summary](#2-model-summary) -3. [Model Downloads](#3-model-downloads) -4. [Evaluation Results](#4-evaluation-results) -5. [Chat Website & API Platform](#5-chat-website--api-platform) -6. [How to Run Locally](#6-how-to-run-locally) -7. [License](#7-license) -8. [Citation](#8-citation) -9. [Contact](#9-contact) +## System Architecture +### High-Level Component Overview -## 1. Introduction +The DeepSeek V3 Zig implementation consists of the following major components: -We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. -To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. -Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. -We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. -Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. -Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. -In addition, its training process is remarkably stable. -Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. -

- -

- -## 2. Model Summary - ---- - -**Architecture: Innovative Load Balancing Strategy and Training Objective** - -- On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free strategy for load balancing, which minimizes the performance degradation that arises from encouraging load balancing. -- We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model performance. - It can also be used for speculative decoding for inference acceleration. - ---- - -**Pre-Training: Towards Ultimate Training Efficiency** - -- We design an FP8 mixed precision training framework and, for the first time, validate the feasibility and effectiveness of FP8 training on an extremely large-scale model. -- Through co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, nearly achieving full computation-communication overlap. - This significantly enhances our training efficiency and reduces the training costs, enabling us to further scale up the model size without additional overhead. -- At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. The subsequent training stages after pre-training require only 0.1M GPU hours. - ---- - -**Post-Training: Knowledge Distillation from DeepSeek-R1** - -- We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance. Meanwhile, we also maintain a control over the output style and length of DeepSeek-V3. - ---- - - -## 3. Model Downloads - -
- -| **Model** | **#Total Params** | **#Activated Params** | **Context Length** | **Download** | -| :------------: | :------------: | :------------: | :------------: | :------------: | -| DeepSeek-V3-Base | 671B | 37B | 128K | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V3-Base) | -| DeepSeek-V3 | 671B | 37B | 128K | [🤗 Hugging Face](https://huggingface.co/deepseek-ai/DeepSeek-V3) | - -
- -> [!NOTE] -> The total size of DeepSeek-V3 models on Hugging Face is 685B, which includes 671B of the Main Model weights and 14B of the Multi-Token Prediction (MTP) Module weights. - -To ensure optimal performance and flexibility, we have partnered with open-source communities and hardware vendors to provide multiple ways to run the model locally. For step-by-step guidance, check out Section 6: [How_to Run_Locally](#6-how-to-run-locally). - -For developers looking to dive deeper, we recommend exploring [README_WEIGHTS.md](./README_WEIGHTS.md) for details on the Main Model weights and the Multi-Token Prediction (MTP) Modules. Please note that MTP support is currently under active development within the community, and we welcome your contributions and feedback. - -## 4. Evaluation Results -### Base Model -#### Standard Benchmarks - -
- - -| | Benchmark (Metric) | # Shots | DeepSeek-V2 | Qwen2.5 72B | LLaMA3.1 405B | DeepSeek-V3 | -|---|-------------------|----------|--------|-------------|---------------|---------| -| | Architecture | - | MoE | Dense | Dense | MoE | -| | # Activated Params | - | 21B | 72B | 405B | 37B | -| | # Total Params | - | 236B | 72B | 405B | 671B | -| English | Pile-test (BPB) | - | 0.606 | 0.638 | **0.542** | 0.548 | -| | BBH (EM) | 3-shot | 78.8 | 79.8 | 82.9 | **87.5** | -| | MMLU (Acc.) | 5-shot | 78.4 | 85.0 | 84.4 | **87.1** | -| | MMLU-Redux (Acc.) | 5-shot | 75.6 | 83.2 | 81.3 | **86.2** | -| | MMLU-Pro (Acc.) | 5-shot | 51.4 | 58.3 | 52.8 | **64.4** | -| | DROP (F1) | 3-shot | 80.4 | 80.6 | 86.0 | **89.0** | -| | ARC-Easy (Acc.) | 25-shot | 97.6 | 98.4 | 98.4 | **98.9** | -| | ARC-Challenge (Acc.) | 25-shot | 92.2 | 94.5 | **95.3** | **95.3** | -| | HellaSwag (Acc.) | 10-shot | 87.1 | 84.8 | **89.2** | 88.9 | -| | PIQA (Acc.) | 0-shot | 83.9 | 82.6 | **85.9** | 84.7 | -| | WinoGrande (Acc.) | 5-shot | **86.3** | 82.3 | 85.2 | 84.9 | -| | RACE-Middle (Acc.) | 5-shot | 73.1 | 68.1 | **74.2** | 67.1 | -| | RACE-High (Acc.) | 5-shot | 52.6 | 50.3 | **56.8** | 51.3 | -| | TriviaQA (EM) | 5-shot | 80.0 | 71.9 | 82.7 | **82.9** | -| | NaturalQuestions (EM) | 5-shot | 38.6 | 33.2 | **41.5** | 40.0 | -| | AGIEval (Acc.) | 0-shot | 57.5 | 75.8 | 60.6 | **79.6** | -| Code | HumanEval (Pass@1) | 0-shot | 43.3 | 53.0 | 54.9 | **65.2** | -| | MBPP (Pass@1) | 3-shot | 65.0 | 72.6 | 68.4 | **75.4** | -| | LiveCodeBench-Base (Pass@1) | 3-shot | 11.6 | 12.9 | 15.5 | **19.4** | -| | CRUXEval-I (Acc.) | 2-shot | 52.5 | 59.1 | 58.5 | **67.3** | -| | CRUXEval-O (Acc.) | 2-shot | 49.8 | 59.9 | 59.9 | **69.8** | -| Math | GSM8K (EM) | 8-shot | 81.6 | 88.3 | 83.5 | **89.3** | -| | MATH (EM) | 4-shot | 43.4 | 54.4 | 49.0 | **61.6** | -| | MGSM (EM) | 8-shot | 63.6 | 76.2 | 69.9 | **79.8** | -| | CMath (EM) | 3-shot | 78.7 | 84.5 | 77.3 | **90.7** | -| Chinese | CLUEWSC (EM) | 5-shot | 82.0 | 82.5 | **83.0** | 82.7 | -| | C-Eval (Acc.) | 5-shot | 81.4 | 89.2 | 72.5 | **90.1** | -| | CMMLU (Acc.) | 5-shot | 84.0 | **89.5** | 73.7 | 88.8 | -| | CMRC (EM) | 1-shot | **77.4** | 75.8 | 76.0 | 76.3 | -| | C3 (Acc.) | 0-shot | 77.4 | 76.7 | **79.7** | 78.6 | -| | CCPM (Acc.) | 0-shot | **93.0** | 88.5 | 78.6 | 92.0 | -| Multilingual | MMMLU-non-English (Acc.) | 5-shot | 64.0 | 74.8 | 73.8 | **79.4** | - -
- -> [!NOTE] -> Best results are shown in bold. Scores with a gap not exceeding 0.3 are considered to be at the same level. DeepSeek-V3 achieves the best performance on most benchmarks, especially on math and code tasks. -> For more evaluation details, please check our paper. - -#### Context Window -

- -

- -Evaluation results on the ``Needle In A Haystack`` (NIAH) tests. DeepSeek-V3 performs well across all context window lengths up to **128K**. - -### Chat Model -#### Standard Benchmarks (Models larger than 67B) -
- -| | **Benchmark (Metric)** | **DeepSeek V2-0506** | **DeepSeek V2.5-0905** | **Qwen2.5 72B-Inst.** | **Llama3.1 405B-Inst.** | **Claude-3.5-Sonnet-1022** | **GPT-4o 0513** | **DeepSeek V3** | -|---|---------------------|---------------------|----------------------|---------------------|----------------------|---------------------------|----------------|----------------| -| | Architecture | MoE | MoE | Dense | Dense | - | - | MoE | -| | # Activated Params | 21B | 21B | 72B | 405B | - | - | 37B | -| | # Total Params | 236B | 236B | 72B | 405B | - | - | 671B | -| English | MMLU (EM) | 78.2 | 80.6 | 85.3 | **88.6** | **88.3** | 87.2 | **88.5** | -| | MMLU-Redux (EM) | 77.9 | 80.3 | 85.6 | 86.2 | **88.9** | 88.0 | **89.1** | -| | MMLU-Pro (EM) | 58.5 | 66.2 | 71.6 | 73.3 | **78.0** | 72.6 | 75.9 | -| | DROP (3-shot F1) | 83.0 | 87.8 | 76.7 | 88.7 | 88.3 | 83.7 | **91.6** | -| | IF-Eval (Prompt Strict) | 57.7 | 80.6 | 84.1 | 86.0 | **86.5** | 84.3 | 86.1 | -| | GPQA-Diamond (Pass@1) | 35.3 | 41.3 | 49.0 | 51.1 | **65.0** | 49.9 | 59.1 | -| | SimpleQA (Correct) | 9.0 | 10.2 | 9.1 | 17.1 | 28.4 | **38.2** | 24.9 | -| | FRAMES (Acc.) | 66.9 | 65.4 | 69.8 | 70.0 | 72.5 | **80.5** | 73.3 | -| | LongBench v2 (Acc.) | 31.6 | 35.4 | 39.4 | 36.1 | 41.0 | 48.1 | **48.7** | -| Code | HumanEval-Mul (Pass@1) | 69.3 | 77.4 | 77.3 | 77.2 | 81.7 | 80.5 | **82.6** | -| | LiveCodeBench (Pass@1-COT) | 18.8 | 29.2 | 31.1 | 28.4 | 36.3 | 33.4 | **40.5** | -| | LiveCodeBench (Pass@1) | 20.3 | 28.4 | 28.7 | 30.1 | 32.8 | 34.2 | **37.6** | -| | Codeforces (Percentile) | 17.5 | 35.6 | 24.8 | 25.3 | 20.3 | 23.6 | **51.6** | -| | SWE Verified (Resolved) | - | 22.6 | 23.8 | 24.5 | **50.8** | 38.8 | 42.0 | -| | Aider-Edit (Acc.) | 60.3 | 71.6 | 65.4 | 63.9 | **84.2** | 72.9 | 79.7 | -| | Aider-Polyglot (Acc.) | - | 18.2 | 7.6 | 5.8 | 45.3 | 16.0 | **49.6** | -| Math | AIME 2024 (Pass@1) | 4.6 | 16.7 | 23.3 | 23.3 | 16.0 | 9.3 | **39.2** | -| | MATH-500 (EM) | 56.3 | 74.7 | 80.0 | 73.8 | 78.3 | 74.6 | **90.2** | -| | CNMO 2024 (Pass@1) | 2.8 | 10.8 | 15.9 | 6.8 | 13.1 | 10.8 | **43.2** | -| Chinese | CLUEWSC (EM) | 89.9 | 90.4 | **91.4** | 84.7 | 85.4 | 87.9 | 90.9 | -| | C-Eval (EM) | 78.6 | 79.5 | 86.1 | 61.5 | 76.7 | 76.0 | **86.5** | -| | C-SimpleQA (Correct) | 48.5 | 54.1 | 48.4 | 50.4 | 51.3 | 59.3 | **64.8** | - -
- -> [!NOTE] -> All models are evaluated in a configuration that limits the output length to 8K. Benchmarks containing fewer than 1000 samples are tested multiple times using varying temperature settings to derive robust final results. DeepSeek-V3 stands as the best-performing open-source model, and also exhibits competitive performance against frontier closed-source models. - - -#### Open Ended Generation Evaluation - -
- - - -| Model | Arena-Hard | AlpacaEval 2.0 | -|-------|------------|----------------| -| DeepSeek-V2.5-0905 | 76.2 | 50.5 | -| Qwen2.5-72B-Instruct | 81.2 | 49.1 | -| LLaMA-3.1 405B | 69.3 | 40.5 | -| GPT-4o-0513 | 80.4 | 51.1 | -| Claude-Sonnet-3.5-1022 | 85.2 | 52.0 | -| DeepSeek-V3 | **85.5** | **70.0** | - -
- -> [!NOTE] -> English open-ended conversation evaluations. For AlpacaEval 2.0, we use the length-controlled win rate as the metric. - - -## 5. Chat Website & API Platform -You can chat with DeepSeek-V3 on DeepSeek's official website: [chat.deepseek.com](https://chat.deepseek.com/sign_in) - -We also provide OpenAI-Compatible API at DeepSeek Platform: [platform.deepseek.com](https://platform.deepseek.com/) - -## 6. How to Run Locally - -DeepSeek-V3 can be deployed locally using the following hardware and open-source community software: - -1. **DeepSeek-Infer Demo**: We provide a simple and lightweight demo for FP8 and BF16 inference. -2. **SGLang**: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes, with Multi-Token Prediction [coming soon](https://github.com/sgl-project/sglang/issues/2591). -3. **LMDeploy**: Enables efficient FP8 and BF16 inference for local and cloud deployment. -4. **TensorRT-LLM**: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon. -5. **vLLM**: Support DeepSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism. -6. **LightLLM**: Supports efficient single-node or multi-node deployment for FP8 and BF16. -7. **AMD GPU**: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes. -8. **Huawei Ascend NPU**: Supports running DeepSeek-V3 on Huawei Ascend devices. - -Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation. - -Here is an example of converting FP8 weights to BF16: - -```shell -cd inference -python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights +``` +DeepSeek V3 Zig +│ +├── Core +│ ├── Memory Management System +│ │ ├── Custom Allocator Framework +│ │ ├── Arena Allocation Strategy +│ │ └── Memory Pool Implementation +│ ├── Tensor Implementation +│ │ ├── SIMD-Optimized Operations +│ │ ├── Compile-Time Specialization +│ │ └── Zero-Cost Abstractions +│ └── Error Handling Framework +│ ├── Comprehensive Error Types +│ └── Performance-Optimized Error Paths +│ +├── Model Architecture +│ ├── Transformer Layers +│ │ ├── Comptime-Generated Layer Variants +│ │ └── Optimized Forward Pass +│ ├── Attention Mechanisms +│ │ ├── Vectorized Multi-Head Attention +│ │ └── Efficient KV-Cache Management +│ ├── MoE (Mixture of Experts) +│ │ ├── Parallel Expert Execution +│ │ └── Optimized Router Implementation +│ └── Embedding Systems +│ ├── Memory-Efficient Token Embeddings +│ └── Positional Encoding Optimizations +│ +├── Computation Backend +│ ├── CPU Implementation +│ │ ├── SIMD Vectorization +│ │ └── Multi-Threaded Execution +│ ├── GPU Integration (Optional) +│ │ ├── CUDA Support (NVIDIA) +│ │ ├── Metal Support (Apple) +│ │ └── ROCm Support (AMD) +│ └── Backend Interface Layer +│ ├── Zero-Cost Abstraction +│ └── Compile-Time Dispatch +│ +├── Inference Pipeline +│ ├── Model Loading & Weight Management +│ ├── Tokenization System +│ ├── Advanced Generation Strategies +│ │ ├── Speculative Decoding +│ │ └── Beam Search +│ └── Streaming Output Processing +│ +└── Optimization Layer + ├── Compile-Time Specialization + │ ├── Architecture-Specific Code Gen + │ └── Tensor Operation Optimization + ├── Runtime Performance Tuning + │ ├── Cache-Aware Memory Layout + │ └── Workload Balancing + └── Quantization Framework + ├── Mixed-Precision Support + └── Hardware-Accelerated Execution ``` -> [!NOTE] -> Hugging Face's Transformers has not been directly supported yet. +## Detailed Component Design -### 6.1 Inference with DeepSeek-Infer Demo (example only) +### 1. Core Systems -#### System Requirements +#### 1.1 Memory Management System -> [!NOTE] -> Linux with Python 3.10 only. Mac and Windows are not supported. +Memory management in Zig represents a significant advancement over Python's garbage collection. Zig provides explicit allocator interfaces that give fine-grained control over memory allocation and deallocation strategies: -Dependencies: -```pip-requirements -torch==2.4.1 -triton==3.0.0 -transformers==4.46.3 -safetensors==0.4.5 -``` -#### Model Weights & Demo Code Preparation +```zig +const std = @import("std"); -First, clone our DeepSeek-V3 GitHub repository: +// Define a custom tensor allocator that combines multiple strategies +pub const TensorAllocator = struct { + // Use arena for temporary tensor operations during inference + arena: std.heap.ArenaAllocator, + // Use a fixed buffer for small activations + fixed_buffer: [1024 * 1024]u8 = undefined, + fixed_allocator: std.heap.FixedBufferAllocator, + // General purpose allocator for long-lived objects + gpa: std.heap.GeneralPurposeAllocator(.{}), + + pub fn init(backing_allocator: std.mem.Allocator) !*TensorAllocator { + var self = try backing_allocator.create(TensorAllocator); + self.* = .{ + .arena = std.heap.ArenaAllocator.init(backing_allocator), + .fixed_allocator = std.heap.FixedBufferAllocator.init(&self.fixed_buffer), + .gpa = std.heap.GeneralPurposeAllocator(.{}){}, + }; + return self; + } + + pub fn deinit(self: *TensorAllocator) void { + self.arena.deinit(); + _ = self.gpa.deinit(); + // backing allocator will free self + } + + // Get the right allocator for specific tensor use cases + pub fn temporaryAllocator(self: *TensorAllocator) std.mem.Allocator { + return self.arena.allocator(); + } + + pub fn smallActivationAllocator(self: *TensorAllocator) std.mem.Allocator { + return self.fixed_allocator.allocator(); + } + + pub fn persistentAllocator(self: *TensorAllocator) std.mem.Allocator { + return self.gpa.allocator(); + } +}; -```shell -git clone https://github.com/deepseek-ai/DeepSeek-V3.git -``` - -Navigate to the `inference` folder and install dependencies listed in `requirements.txt`. Easiest way is to use a package manager like `conda` or `uv` to create a new virtual environment and install the dependencies. - -```shell -cd DeepSeek-V3/inference -pip install -r requirements.txt -``` - -Download the model weights from Hugging Face, and put them into `/path/to/DeepSeek-V3` folder. - -#### Model Weights Conversion - -Convert Hugging Face model weights to a specific format: - -```shell -python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16 -``` - -#### Run - -Then you can chat with DeepSeek-V3: - -```shell -torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200 -``` - -Or batch inference on a given file: - -```shell -torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE -``` - -### 6.2 Inference with SGLang (recommended) - -[SGLang](https://github.com/sgl-project/sglang) currently supports [MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations), [DP Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), FP8 (W8A8), FP8 KV Cache, and Torch Compile, delivering state-of-the-art latency and throughput performance among open-source frameworks. - -Notably, [SGLang v0.4.1](https://github.com/sgl-project/sglang/releases/tag/v0.4.1) fully supports running DeepSeek-V3 on both **NVIDIA and AMD GPUs**, making it a highly versatile and robust solution. - -SGLang also supports [multi-node tensor parallelism](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208), enabling you to run this model on multiple network-connected machines. - -Multi-Token Prediction (MTP) is in development, and progress can be tracked in the [optimization plan](https://github.com/sgl-project/sglang/issues/2591). - -Here are the launch instructions from the SGLang team: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 - -### 6.3 Inference with LMDeploy (recommended) -[LMDeploy](https://github.com/InternLM/lmdeploy), a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows. - -For comprehensive step-by-step instructions on running DeepSeek-V3 with LMDeploy, please refer to here: https://github.com/InternLM/lmdeploy/issues/2960 - - -### 6.4 Inference with TRT-LLM (recommended) - -[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/deepseek_v3. - - -### 6.5 Inference with vLLM (recommended) - -[vLLM](https://github.com/vllm-project/vllm) v0.6.6 supports DeepSeek-V3 inference for FP8 and BF16 modes on both NVIDIA and AMD GPUs. Aside from standard techniques, vLLM offers _pipeline parallelism_ allowing you to run this model on multiple machines connected by networks. For detailed guidance, please refer to the [vLLM instructions](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). Please feel free to follow [the enhancement plan](https://github.com/vllm-project/vllm/issues/11539) as well. - -### 6.6 Inference with LightLLM (recommended) - -[LightLLM](https://github.com/ModelTC/lightllm/tree/main) v1.0.1 supports single-machine and multi-machine tensor parallel deployment for DeepSeek-R1 (FP8/BF16) and provides mixed-precision deployment, with more quantization modes continuously integrated. For more details, please refer to [LightLLM instructions](https://lightllm-en.readthedocs.io/en/latest/getting_started/quickstart.html). Additionally, LightLLM offers PD-disaggregation deployment for DeepSeek-V2, and the implementation of PD-disaggregation for DeepSeek-V3 is in development. - -### 6.7 Recommended Inference Functionality with AMD GPUs - -In collaboration with the AMD team, we have achieved Day-One support for AMD GPUs using SGLang, with full compatibility for both FP8 and BF16 precision. For detailed guidance, please refer to the [SGLang instructions](#63-inference-with-lmdeploy-recommended). - -### 6.8 Recommended Inference Functionality with Huawei Ascend NPUs -The [MindIE](https://www.hiascend.com/en/software/mindie) framework from the Huawei Ascend community has successfully adapted the BF16 version of DeepSeek-V3. For step-by-step guidance on Ascend NPUs, please follow the [instructions here](https://modelers.cn/models/MindIE/deepseekv3). - - -## 7. License -This code repository is licensed under [the MIT License](LICENSE-CODE). The use of DeepSeek-V3 Base/Chat models is subject to [the Model License](LICENSE-MODEL). DeepSeek-V3 series (including Base and Chat) supports commercial use. - -## 8. Citation -``` -@misc{deepseekai2024deepseekv3technicalreport, - title={DeepSeek-V3 Technical Report}, - author={DeepSeek-AI}, - year={2024}, - eprint={2412.19437}, - archivePrefix={arXiv}, - primaryClass={cs.CL}, - url={https://arxiv.org/abs/2412.19437}, +// Inference function example with specialized memory allocation +pub fn performInference(model: *Model, input: Tensor) !Tensor { + var allocator = try TensorAllocator.init(std.heap.page_allocator); + defer allocator.deinit(); + + // Use different allocators for different tensor operations + var activations = try computeActivations(model, input, allocator.temporaryAllocator()); + var weights = try loadModelWeights(model, allocator.persistentAllocator()); + + // Results are automatically freed when the arena is deinitialized + return try generateOutput(activations, weights, allocator.temporaryAllocator()); } ``` -## 9. Contact -If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com). +**Key Features:** +- **Tiered Allocation Strategy**: Different allocators for different memory usage patterns +- **Arena Allocation**: Bulk allocation and freeing for intermediate tensors, dramatically reducing memory management overhead +- **Fixed Buffer Allocation**: Zero-heap-allocation path for small, predictable tensor operations +- **Memory Pool Implementation**: Custom pools for tensor data to minimize fragmentation +- **Explicit Error Handling**: All allocation failures are explicitly handled with Zig's error system + +#### 1.2 Tensor Implementation + +Tensors are the fundamental data structure for DeepSeek. Our implementation leverages Zig's advanced compile-time features, SIMD capabilities, and memory layout optimizations for maximum performance: + +```zig +pub fn Tensor(comptime DataType: type, comptime dimensions: usize) type { + return struct { + const Self = @This(); + + data: []DataType, + shape: [dimensions]usize, + strides: [dimensions]usize, + allocator: std.mem.Allocator, + is_contiguous: bool, + + // Vector types for SIMD operations based on hardware capabilities + pub const VecType = switch (DataType) { + f32 => if (@hasDecl(builtin, "cpu") and @hasDecl(builtin.cpu, "avx")) + @Vector(8, f32) // AVX + else if (@hasDecl(builtin, "cpu") and @hasDecl(builtin.cpu, "sse")) + @Vector(4, f32) // SSE + else + @Vector(4, f32), // Fallback + f16 => @Vector(8, f16), + i32 => @Vector(8, i32), + i8 => @Vector(16, i8), + else => @compileError("Unsupported data type for SIMD"), + }; + + // Number of elements in the SIMD vector + pub const vec_width = @sizeOf(VecType) / @sizeOf(DataType); + + pub fn init(allocator: std.mem.Allocator, shape: [dimensions]usize) !Self { + var strides: [dimensions]usize = undefined; + var total_size: usize = 1; + + // Calculate C-contiguous (row-major) strides for optimal memory access + var i: usize = dimensions; + while (i > 0) { + i -= 1; + strides[i] = total_size; + total_size *= shape[i]; + } + + // Align memory for optimal SIMD access + const alignment = @alignOf(VecType); + const data = try allocator.alignedAlloc(DataType, alignment, total_size); + + return Self{ + .data = data, + .shape = shape, + .strides = strides, + .allocator = allocator, + .is_contiguous = true, + }; + } + + pub fn deinit(self: *Self) void { + self.allocator.free(self.data); + } + + // Optimized SIMD matrix multiplication for 2D tensors + pub fn matmul(self: *Self, other: *Self, allocator: std.mem.Allocator) !Self { + std.debug.assert(dimensions == 2 and other.dimensions == 2); + std.debug.assert(self.shape[1] == other.shape[0]); + + const M = self.shape[0]; + const K = self.shape[1]; + const N = other.shape[1]; + + var result = try Self.init(allocator, .{ M, N }); + + // Zero initialization + @memset(result.data, 0); + + // Check if both tensors are contiguous for optimal performance + if (self.is_contiguous and other.is_contiguous) { + // Cache-aware blocked matrix multiplication with SIMD + const block_size = 64; // Tuned for L1 cache + + // For each block + var i: usize = 0; + while (i < M) : (i += block_size) { + const i_end = @min(i + block_size, M); + var j: usize = 0; + while (j < N) : (j += block_size) { + const j_end = @min(j + block_size, N); + var k: usize = 0; + while (k < K) : (k += block_size) { + const k_end = @min(k + block_size, K); + + // Process each block + var ii: usize = i; + while (ii < i_end) : (ii += 1) { + var jj: usize = j; + while (jj < j_end) : (jj += vec_width) { + // SIMD-optimized inner loop + if (jj + vec_width <= j_end) { + var sum: VecType = @splat(0); + var kk: usize = k; + while (kk < k_end) : (kk += 1) { + const a_val = self.data[ii * K + kk]; + const b_vec: VecType = blk: { + var tmp: [vec_width]DataType = undefined; + for (0..vec_width) |v| { + if (jj + v < j_end) { + tmp[v] = other.data[kk * N + (jj + v)]; + } else { + tmp[v] = 0; + } + } + break :blk tmp; + }; + sum += @splat(a_val) * b_vec; + } + + // Store result + for (0..vec_width) |v| { + if (jj + v < j_end) { + result.data[ii * N + (jj + v)] += sum[v]; + } + } + } else { + // Handle remaining columns (tail) + while (jj < j_end) : (jj += 1) { + var sum: DataType = 0; + var kk: usize = k; + while (kk < k_end) : (kk += 1) { + sum += self.data[ii * K + kk] * other.data[kk * N + jj]; + } + result.data[ii * N + jj] += sum; + } + } + } + } + } + } + } + } else { + // Fallback for non-contiguous tensors + var i: usize = 0; + while (i < M) : (i += 1) { + var j: usize = 0; + while (j < N) : (j += 1) { + var sum: DataType = 0; + var k: usize = 0; + while (k < K) : (k += 1) { + sum += self.at(.{i, k}) * other.at(.{k, j}); + } + try result.set(.{i, j}, sum); + } + } + } + + return result; + } + + // Access element at specific indices + pub fn at(self: Self, indices: [dimensions]usize) DataType { + var offset: usize = 0; + inline for (0..dimensions) |i| { + offset += indices[i] * self.strides[i]; + } + return self.data[offset]; + } + + // Set element at specific indices + pub fn set(self: *Self, indices: [dimensions]usize, value: DataType) !void { + var offset: usize = 0; + inline for (0..dimensions) |i| { + offset += indices[i] * self.strides[i]; + } + self.data[offset] = value; + } + + // Apply element-wise operations with SIMD acceleration + pub fn map(self: Self, comptime op: fn (DataType) DataType, allocator: std.mem.Allocator) !Self { + var result = try Self.init(allocator, self.shape); + + // Use SIMD operations for contiguous data + if (self.is_contiguous) { + var i: usize = 0; + const vec_chunks = self.data.len / vec_width; + + // Process in SIMD chunks + while (i < vec_chunks) : (i += 1) { + const base_idx = i * vec_width; + var vec: VecType = undefined; + + // Load vector + for (0..vec_width) |j| { + vec[j] = self.data[base_idx + j]; + } + + // Apply operation on each vector element + for (0..vec_width) |j| { + vec[j] = op(vec[j]); + } + + // Store result + for (0..vec_width) |j| { + result.data[base_idx + j] = vec[j]; + } + } + + // Process remaining elements + const remaining_start = vec_chunks * vec_width; + for (remaining_start..self.data.len) |j| { + result.data[j] = op(self.data[j]); + } + } else { + // Fallback for non-contiguous data + var indices: [dimensions]usize = .{0} ** dimensions; + var done = false; + + while (!done) { + const val = self.at(indices); + try result.set(indices, op(val)); + + // Increment indices + var d = dimensions - 1; + while (true) { + indices[d] += 1; + if (indices[d] < self.shape[d]) break; + indices[d] = 0; + if (d == 0) { + done = true; + break; + } + d -= 1; + } + } + } + + return result; + } + }; +} + +// Specialized tensor types for common uses +const FloatTensor1D = Tensor(f32, 1); +const FloatTensor2D = Tensor(f32, 2); +const FloatTensor4D = Tensor(f32, 4); // Common for batch x height x width x channels +const QuantizedTensor4D = Tensor(i8, 4); // For quantized operations +``` + +**Key Features:** +- **Hardware-Aware SIMD Vectorization**: Automatically selects optimal vector width based on CPU capabilities (AVX, SSE) +- **Cache-Optimized Algorithms**: Blocked matrix multiplication designed for L1/L2 cache efficiency +- **Aligned Memory Allocation**: Ensures data is properly aligned for SIMD operations +- **Specialized Tensor Types**: Pre-defined tensor configurations for common use cases +- **Automatic Fallbacks**: Graceful degradation for non-contiguous tensors or unsupported operations +- **Compile-Time Optimization**: Tensor dimensions and data types resolved at compile time for maximum performance +- **Zero-Runtime Overhead**: SIMD operations with no dynamic dispatch or virtual function calls + +#### 1.3 Error Handling Framework + +Zig's error handling system provides a powerful foundation for creating robust, high-performance software. Unlike exceptions in languages like C++ or Python, Zig's error handling is explicit and deterministic, making it particularly well-suited for large-scale machine learning applications: + +```zig +// Define a comprehensive set of potential errors with clear semantic meaning +const ModelError = error{ + ModelLoadFailed, + InvalidDimension, + InvalidShape, + OutOfMemory, + ComputeBackendError, + InvalidWeight, + UnsupportedOperation, + UnsupportedDataType, + DeviceNotAvailable, + TensorShapeMismatch, + QuantizationError, + InvalidConfiguration, +}; + +// Union error sets for comprehensive error handling +const DeepSeekError = ModelError || TensorError || AllocationError || IoError; + +// Example function demonstrating Zig's error handling with defer for cleanup +fn loadModel(allocator: std.mem.Allocator, path: []const u8) DeepSeekError!*Model { + var file = try std.fs.cwd().openFile(path, .{}); + defer file.close(); // Ensures file is closed even if an error occurs + + var buffer = std.ArrayList(u8).init(allocator); + defer buffer.deinit(); // Clean up buffer regardless of success/failure + + try buffer.ensureTotalCapacity(file.getEndPos() catch return ModelError.ModelLoadFailed); + + const bytes_read = try file.readAll(buffer.items); + if (bytes_read == 0) return ModelError.ModelLoadFailed; + + var model = try allocator.create(Model); + errdefer allocator.destroy(model); // Only called if an error occurs after this point + + model.* = Model.init(allocator); + errdefer model.deinit(); // Only called if an error occurs after this point + + // Parse weights and initialize model... + if (!try parseWeights(model, buffer.items)) { + return ModelError.InvalidWeight; + } + + return model; +} + +// Demonstrate error handling in caller code +pub fn main() !void { + var gpa = std.heap.GeneralPurposeAllocator(.{}){}; + defer _ = gpa.deinit(); + const allocator = gpa.allocator(); + + // Handle errors explicitly with try/catch blocks + const model = loadModel(allocator, "model.bin") catch |err| { + switch (err) { + ModelError.ModelLoadFailed => { + std.debug.print("Failed to load model file\n", .{}); + return err; + }, + ModelError.InvalidWeight => { + std.debug.print("Model contains invalid weights\n", .{}); + return err; + }, + else => { + std.debug.print("Unexpected error: {}\n", .{err}); + return err; + }, + } + }; + defer model.deinit(); + + // Continue with model usage... +} +``` + +**Key Features:** +- **Explicit Error Types**: Clearly defined error sets that precisely describe what can go wrong +- **No Exceptions**: Deterministic error handling with no hidden control flow +- **Resource Safety**: Automatic cleanup with `defer` and `errdefer` ensures resources are properly managed +- **Performance Optimization**: Error handling doesn't rely on stack unwinding or dynamic dispatch +- **Composable Error Sets**: Error types can be combined using the `||` operator +- **Try-Catch Blocks**: For selective error handling when needed +- **Error Tracing**: Built-in error return trace capability for debugging + +#### 1.4 Concurrency Model + +Zig's concurrency model will be leveraged to parallelize computation-intensive operations in DeepSeek. Zig's async/await syntax provides a structured approach to concurrency without the overhead of traditional threading: + +```zig +const std = @import("std"); + +// Thread pool for CPU-bound parallel tasks +pub const ComputeThreadPool = struct { + pool: std.Thread.Pool, + completion_count: std.atomic.Atomic(usize), + + pub fn init(thread_count: usize) !ComputeThreadPool { + var pool: std.Thread.Pool = undefined; + try pool.init(.{ + .allocator = std.heap.c_allocator, + .n_jobs = thread_count, + }); + + return ComputeThreadPool{ + .pool = pool, + .completion_count = std.atomic.Atomic(usize).init(0), + }; + } + + pub fn deinit(self: *ComputeThreadPool) void { + self.pool.deinit(); + } + + // Execute a compute task asynchronously + pub fn compute(self: *ComputeThreadPool, task: *const fn(*anyopaque) void, context: *anyopaque) !void { + try self.pool.spawn(task, context); + } + + // Wait for all compute tasks to complete + pub fn waitAll(self: *ComputeThreadPool) void { + // Process tasks in the event loop until all are complete + while (self.completion_count.load(.Acquire) > 0) { + std.time.sleep(1 * std.time.millisecond); + } + } +}; + +// Parallel tensor operation example with async/await +pub fn parallelMatMul(allocator: std.mem.Allocator, a: *Tensor(f32, 2), b: *Tensor(f32, 2)) !*Tensor(f32, 2) { + const M = a.shape[0]; + const K = a.shape[1]; + const N = b.shape[1]; + + var result = try Tensor(f32, 2).init(allocator, .{M, N}); + errdefer result.deinit(); + + @memset(result.data, 0); + + // Create thread pool with optimal number of threads + const cpu_count = try std.Thread.getCpuCount(); + var thread_pool = try ComputeThreadPool.init(cpu_count); + defer thread_pool.deinit(); + + // Split work based on number of available cores + const rows_per_thread = (M + cpu_count - 1) / cpu_count; + + // Define the worker task + const WorkContext = struct { + a: *const Tensor(f32, 2), + b: *const Tensor(f32, 2), + result: *Tensor(f32, 2), + start_row: usize, + end_row: usize, + thread_pool: *ComputeThreadPool, + }; + + // Worker function for computing a subset of rows + const workerFn = struct { + fn compute(context_ptr: *anyopaque) void { + const context = @ptrCast(*WorkContext, @alignCast(@alignOf(WorkContext), context_ptr)); + const a = context.a; + const b = context.b; + const result = context.result; + const start_row = context.start_row; + const end_row = context.end_row; + + // Compute assigned rows + for (start_row..end_row) |i| { + if (i >= a.shape[0]) break; + + for (0..b.shape[1]) |j| { + var sum: f32 = 0.0; + for (0..a.shape[1]) |k| { + sum += a.at(.{i, k}) * b.at(.{k, j}); + } + result.set(.{i, j}, sum) catch {}; + } + } + + // Mark task as complete + _ = context.thread_pool.completion_count.fetchSub(1, .Release); + } + }; + + // Spawn workers for each section of the matrix + for (0..cpu_count) |i| { + const start_row = i * rows_per_thread; + const end_row = std.math.min(start_row + rows_per_thread, M); + + if (start_row >= M) break; + + // Create context for this worker + var context = try allocator.create(WorkContext); + context.* = .{ + .a = a, + .b = b, + .result = result, + .start_row = start_row, + .end_row = end_row, + .thread_pool = &thread_pool, + }; + + // Increment completion counter before spawning task + _ = thread_pool.completion_count.fetchAdd(1, .Release); + + // Spawn the worker task + try thread_pool.compute(workerFn.compute, context); + } + + // Wait for all tasks to complete + thread_pool.waitAll(); + + return result; +} +``` + +**Key Features:** +- **Thread Pool Management**: Efficient worker thread allocation based on available CPU cores +- **Work Partitioning**: Automatic division of work across available cores +- **Minimal Synchronization**: Lock-free atomic counters for synchronization when needed +- **Resource Safety**: Proper cleanup with `defer` and `errdefer` even during concurrent execution +- **Structured Concurrency**: Clear task dependencies and lifecycle management +- **Zero Runtime Overhead**: No garbage collection or runtime dependencies + +### 2. Model Architecture + +#### 2.1 Transformer Core + +The transformer architecture is the foundation of DeepSeek V3. Our Zig implementation will leverage compile-time metaprogramming and advanced memory optimizations for maximum performance: + +```zig +const std = @import("std"); + +// Precomputed type variants for different data precisions +pub const DataType = enum { + f32, // 32-bit floating point (for debugging/development) + bf16, // BFloat16 (for training/default inference) + f16, // Float16 (for hardware with native f16 support) + i8, // 8-bit integer (for quantized inference) + i4, // 4-bit integer (for extreme quantization) +}; + +// Configuration struct with default values matching DeepSeek V3 +pub const ModelArgs = struct { + // Core model parameters + max_batch_size: usize = 8, + max_seq_len: usize = 4096 * 4, + data_type: DataType = .bf16, + vocab_size: usize = 102400, + dim: usize = 2048, + inter_dim: usize = 10944, + moe_inter_dim: usize = 1408, + n_layers: usize = 27, + n_dense_layers: usize = 1, + n_heads: usize = 16, + + // MoE configuration + n_routed_experts: usize = 64, + n_shared_experts: usize = 2, + n_activated_experts: usize = 6, + n_expert_groups: usize = 1, + n_limited_groups: usize = 1, + score_func: enum { softmax, sigmoid } = .softmax, + route_scale: f32 = 1.0, + + // MLA configuration + q_lora_rank: usize = 0, + kv_lora_rank: usize = 512, + qk_nope_head_dim: usize = 128, + qk_rope_head_dim: usize = 64, + v_head_dim: usize = 128, + + // Positional encoding + original_seq_len: usize = 4096, + rope_theta: f32 = 10000.0, + rope_factor: f32 = 40, + beta_fast: usize = 32, + beta_slow: usize = 1, + mscale: f32 = 1.0, + + // Runtime options + use_flash_attention: bool = true, // Use optimized attention implementation + use_parallel_experts: bool = true, // Run experts in parallel + max_token_limit: ?usize = null, // Optional token generation limit + + // Generate optimized implementations based on config parameters + pub fn getModelType(self: @This()) type { + return struct { + const ModelType = @This(); + const config = self; + + // Select optimal types based on data_type + pub const StorageType = switch (config.data_type) { + .f32 => f32, + .bf16 => std.packed_bf16, + .f16 => f16, + .i8 => i8, + .i4 => i4, + }; + + // Define tensor types for different dimensions + pub const WeightTensor = Tensor(StorageType, 2); + pub const ActivationTensor = Tensor(f32, 3); // Always use f32 for activations + pub const EmbeddingTensor = Tensor(StorageType, 2); + pub const KVCacheTensor = Tensor(f32, 4); // [batch, seq_len, heads, dim] + + // Generate layer configuration + pub const layer_config = struct { + pub const head_dim = (config.dim / config.n_heads); + pub const moe_layers_start = config.n_dense_layers; + }; + }; + } +}; + +// Main transformer model implementation +pub fn TransformerModel(comptime args: ModelArgs) type { + // Use comptime to generate a specialized model implementation based on args + return struct { + const Self = @This(); + const ModelType = args.getModelType(); + + // Model components + allocator: std.mem.Allocator, + embedding: Embedding(args), + layers: []TransformerBlock(args), + norm: RMSNorm(args.dim), + head: Linear(args.dim, args.vocab_size), + freqs_cis: Tensor(f32, 3), // [max_seq_len, 2, qk_rope_head_dim] + + // KV cache for optimized inference + kv_cache: ?ModelType.KVCacheTensor, + + pub fn init(allocator: std.mem.Allocator) !Self { + // Initialize components + var embedding = try Embedding(args).init(allocator); + errdefer embedding.deinit(); + + var layers = try allocator.alloc(TransformerBlock(args), args.n_layers); + errdefer allocator.free(layers); + + // Create layers with appropriate configurations + for (layers, 0..) |*layer, i| { + const is_moe = i >= args.n_dense_layers; + layer.* = try TransformerBlock(args).init(allocator, i, is_moe); + } + + var norm = try RMSNorm(args.dim).init(allocator); + errdefer norm.deinit(); + + var head = try Linear(args.dim, args.vocab_size).init(allocator, false); + errdefer head.deinit(); + + // Precompute positional encoding frequencies + var freqs_cis = try precomputeFreqsCis(allocator, args); + + return Self{ + .allocator = allocator, + .embedding = embedding, + .layers = layers, + .norm = norm, + .head = head, + .freqs_cis = freqs_cis, + .kv_cache = null, + }; + } + + pub fn deinit(self: *Self) void { + self.embedding.deinit(); + + for (self.layers) |*layer| { + layer.deinit(); + } + self.allocator.free(self.layers); + + self.norm.deinit(); + self.head.deinit(); + self.freqs_cis.deinit(); + + if (self.kv_cache) |*cache| { + cache.deinit(); + } + } + + // Initialize KV cache for efficient inference + pub fn initKVCache(self: *Self) !void { + if (self.kv_cache != null) return; + + const batch_size = args.max_batch_size; + const seq_len = args.max_seq_len; + const n_heads = args.n_heads; + const head_dim = ModelType.layer_config.head_dim; + + self.kv_cache = try ModelType.KVCacheTensor.init( + self.allocator, + .{batch_size, seq_len, n_heads, head_dim * 2} + ); + + // Zero-initialize cache + @memset(self.kv_cache.?.data, 0); + } + + // Forward pass through the transformer model + pub fn forward(self: *Self, token_ids: []const usize, start_pos: usize) !Tensor(f32, 2) { + const batch_size = 1; // Currently supporting batch_size=1 for inference + const seq_len = token_ids.len; + + // Create tensor from token_ids + var input_tensor = try ModelType.ActivationTensor.init( + self.allocator, + .{batch_size, seq_len, args.dim} + ); + defer input_tensor.deinit(); + + // Get embeddings for input tokens + try self.embedding.embed(token_ids, &input_tensor); + + // Process through each transformer layer + var x = input_tensor; + const freqs_cis_slice = try self.freqs_cis.slice(.{start_pos, 0, 0}, .{start_pos + seq_len, 2, args.qk_rope_head_dim}); + + // Create attention mask for causal attention + var mask: ?Tensor(f32, 2) = null; + if (seq_len > 1) { + mask = try createCausalMask(self.allocator, seq_len); + defer if (mask) |*m| m.deinit(); + } + + // Process through transformer layers + for (self.layers) |*layer| { + x = try layer.forward(x, start_pos, freqs_cis_slice, mask); + } + + // Apply final normalization + var normalized = try self.norm.forward(x); + defer normalized.deinit(); + + // Extract last token for prediction + var last_token = try normalized.slice( + .{0, seq_len - 1, 0}, + .{batch_size, seq_len, args.dim} + ); + defer last_token.deinit(); + + // Project to vocabulary + return try self.head.forward(last_token); + } + + // Helper to create causal attention mask + fn createCausalMask(allocator: std.mem.Allocator, seq_len: usize) !Tensor(f32, 2) { + var mask = try Tensor(f32, 2).init(allocator, .{seq_len, seq_len}); + errdefer mask.deinit(); + + for (0..seq_len) |i| { + for (0..seq_len) |j| { + const value: f32 = if (j <= i) 0.0 else -10000.0; + try mask.set(.{i, j}, value); + } + } + + return mask; + } + }; +} + +// Generate specialized transformer based on configuration +pub fn createTransformer(allocator: std.mem.Allocator, args: ModelArgs) !*TransformerModel(args) { + var model = try allocator.create(TransformerModel(args)); + errdefer allocator.destroy(model); + + model.* = try TransformerModel(args).init(allocator); + return model; +} +``` + +This implementation leverages Zig's compile-time features to generate specialized model implementations based on the provided configuration parameters. The use of generic types and comptime evaluation allows for maximum performance optimization while maintaining code flexibility. + +#### 2.2 Attention Mechanism + +The Multi-Head Latent Attention (MLA) mechanism is a critical component of DeepSeek V3's performance. Our Zig implementation leverages compile-time specialization, SIMD vectorization, and cache-friendly algorithms for maximum efficiency: + +```zig +// Generic MLA implementation with compile-time specialization +pub fn MLA(comptime args: ModelArgs) type { + return struct { + const Self = @This(); + const ModelType = args.getModelType(); + + // Attention configuration + dim: usize, + n_heads: usize, + head_dim: usize, + q_lora_rank: usize, + kv_lora_rank: usize, + qk_nope_head_dim: usize, + qk_rope_head_dim: usize, + qk_head_dim: usize, + v_head_dim: usize, + softmax_scale: f32, + use_flash_attention: bool, + + // Projection matrices + allocator: std.mem.Allocator, + wq: ?ColumnParallelLinear(args) = null, // Regular query projection + wq_a: ?Linear(args.dim, args.q_lora_rank) = null, // LoRA decomposition + q_norm: ?RMSNorm(args.q_lora_rank) = null, // LoRA normalization + wq_b: ?ColumnParallelLinear(args) = null, // LoRA decomposition + wkv_a: Linear(args.dim, args.kv_lora_rank + args.qk_rope_head_dim), + kv_norm: RMSNorm(args.kv_lora_rank), + wkv_b: ColumnParallelLinear(args), + wo: RowParallelLinear(args), + + // KV caching - optimized for memory access patterns + kv_cache: ?Tensor(f32, 4) = null, // [batch, seq_len, heads, head_dim*2] + rope_cache: ?Tensor(f32, 3) = null, // [batch, seq_len, rope_dim] + + // Initialize MLA with appropriate configuration + pub fn init(allocator: std.mem.Allocator) !Self { + const head_dim = args.dim / args.n_heads; + var softmax_scale = 1.0 / std.math.sqrt(@as(f32, @floatFromInt(args.qk_nope_head_dim + args.qk_rope_head_dim))); + + // Apply scaling for extended context if needed + if (args.max_seq_len > args.original_seq_len) { + const mscale = 0.1 * args.mscale * std.math.log(args.rope_factor) + 1.0; + softmax_scale *= mscale * mscale; + } + + // Initialize query projection (either direct or with LoRA) + var wq: ?ColumnParallelLinear(args) = null; + var wq_a: ?Linear(args.dim, args.q_lora_rank) = null; + var q_norm: ?RMSNorm(args.q_lora_rank) = null; + var wq_b: ?ColumnParallelLinear(args) = null; + + if (args.q_lora_rank == 0) { + // Standard query projection + wq = try ColumnParallelLinear(args).init( + allocator, + args.dim, + args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim), + false + ); + } else { + // Low-rank adaptation for query + wq_a = try Linear(args.dim, args.q_lora_rank).init(allocator, false); + q_norm = try RMSNorm(args.q_lora_rank).init(allocator); + wq_b = try ColumnParallelLinear(args).init( + allocator, + args.q_lora_rank, + args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim), + false + ); + } + + // Key-value projections + var wkv_a = try Linear(args.dim, args.kv_lora_rank + args.qk_rope_head_dim).init(allocator, false); + var kv_norm = try RMSNorm(args.kv_lora_rank).init(allocator); + var wkv_b = try ColumnParallelLinear(args).init( + allocator, + args.kv_lora_rank, + args.n_heads * (args.qk_nope_head_dim + args.v_head_dim), + false + ); + + // Output projection + var wo = try RowParallelLinear(args).init( + allocator, + args.n_heads * args.v_head_dim, + args.dim, + false + ); + + return Self{ + .allocator = allocator, + .dim = args.dim, + .n_heads = args.n_heads, + .head_dim = head_dim, + .q_lora_rank = args.q_lora_rank, + .kv_lora_rank = args.kv_lora_rank, + .qk_nope_head_dim = args.qk_nope_head_dim, + .qk_rope_head_dim = args.qk_rope_head_dim, + .qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim, + .v_head_dim = args.v_head_dim, + .softmax_scale = softmax_scale, + .use_flash_attention = args.use_flash_attention, + .wq = wq, + .wq_a = wq_a, + .q_norm = q_norm, + .wq_b = wq_b, + .wkv_a = wkv_a, + .kv_norm = kv_norm, + .wkv_b = wkv_b, + .wo = wo, + }; + } + + pub fn deinit(self: *Self) void { + if (self.wq) |*w| w.deinit(); + if (self.wq_a) |*w| w.deinit(); + if (self.q_norm) |*n| n.deinit(); + if (self.wq_b) |*w| w.deinit(); + + self.wkv_a.deinit(); + self.kv_norm.deinit(); + self.wkv_b.deinit(); + self.wo.deinit(); + + if (self.kv_cache) |*cache| cache.deinit(); + if (self.rope_cache) |*cache| cache.deinit(); + } + + // Initialize KV cache for efficient inference + pub fn initKVCache(self: *Self, batch_size: usize, seq_len: usize) !void { + if (self.kv_cache != null) return; + + // Allocate KV cache + self.kv_cache = try Tensor(f32, 4).init( + self.allocator, + .{batch_size, seq_len, self.n_heads, self.head_dim * 2} + ); + + // Zero-initialize + @memset(self.kv_cache.?.data, 0); + + // Allocate rotary positional encoding cache + self.rope_cache = try Tensor(f32, 3).init( + self.allocator, + .{batch_size, seq_len, self.qk_rope_head_dim} + ); + + @memset(self.rope_cache.?.data, 0); + } + + // Forward pass implementation with multiple specialized paths + pub fn forward( + self: *Self, + x: Tensor(f32, 3), + start_pos: usize, + freqs_cis: Tensor(f32, 3), + mask: ?Tensor(f32, 2) + ) !Tensor(f32, 3) { + const batch_size = x.shape[0]; + const seq_len = x.shape[1]; + const end_pos = start_pos + seq_len; + + // Initialize KV cache if not already done + if (start_pos > 0 and self.kv_cache == null) { + try self.initKVCache(batch_size, args.max_seq_len); + } + + // Compute query vectors + var q: Tensor(f32, 4) = undefined; + if (self.q_lora_rank == 0) { + // Standard query projection + var q_flat = try self.wq.?.forward(x); + defer q_flat.deinit(); + + // Reshape to [batch, seq_len, heads, head_dim] + q = try q_flat.reshape(.{batch_size, seq_len, self.n_heads, self.qk_head_dim}); + } else { + // Low-rank adaptation + var q_a = try self.wq_a.?.forward(x); + defer q_a.deinit(); + + var q_norm = try self.q_norm.?.forward(q_a); + defer q_norm.deinit(); + + var q_b = try self.wq_b.?.forward(q_norm); + defer q_b.deinit(); + + // Reshape + q = try q_b.reshape(.{batch_size, seq_len, self.n_heads, self.qk_head_dim}); + } + defer q.deinit(); + + // Split query into regular and positional parts + var q_slices = try q.split(3, .{self.qk_nope_head_dim, self.qk_rope_head_dim}); + defer for (q_slices) |*slice| slice.deinit(); + + var q_nope = q_slices[0]; + var q_pe = q_slices[1]; + + // Apply rotary embeddings to position-dependent part + try applyRotaryEmbeddings(&q_pe, freqs_cis); + + // Compute key-value vectors + var kv_raw = try self.wkv_a.forward(x); + defer kv_raw.deinit(); + + // Split into KV features and positional features + var kv_slices = try kv_raw.split(2, .{self.kv_lora_rank, self.qk_rope_head_dim}); + defer for (kv_slices) |*slice| slice.deinit(); + + var kv_features = kv_slices[0]; + var k_pe_features = kv_slices[1]; + + // Add batch and heads dimension to positional features + var k_pe = try k_pe_features.reshape(.{batch_size, seq_len, 1, self.qk_rope_head_dim}); + defer k_pe.deinit(); + + // Apply rotary embeddings + try applyRotaryEmbeddings(&k_pe, freqs_cis); + + // Process main KV branch + var kv_norm_features = try self.kv_norm.forward(kv_features); + defer kv_norm_features.deinit(); + + var kv_proj = try self.wkv_b.forward(kv_norm_features); + defer kv_proj.deinit(); + + // Reshape to separate K and V + var kv_reshaped = try kv_proj.reshape( + .{batch_size, seq_len, self.n_heads, self.qk_nope_head_dim + self.v_head_dim} + ); + defer kv_reshaped.deinit(); + + // Split into K and V + var kv_parts = try kv_reshaped.split(3, .{self.qk_nope_head_dim, self.v_head_dim}); + defer for (kv_parts) |*part| part.deinit(); + + var k_nope = kv_parts[0]; + var v = kv_parts[1]; + + // Combine positional and non-positional key parts + var k = try combineTensors(k_nope, k_pe, 3); + defer k.deinit(); + + // Store in KV cache if available + if (self.kv_cache != null) { + try self.updateKVCache(k, v, start_pos, end_pos); + } + + // Choose attention implementation based on settings + var attention_output: Tensor(f32, 4) = undefined; + if (self.use_flash_attention and seq_len > 1) { + attention_output = try self.computeFlashAttention( + q_nope, + q_pe, + self.kv_cache.?, + self.rope_cache.?, + mask, + batch_size, + seq_len, + end_pos + ); + } else { + attention_output = try self.computeStandardAttention( + q, + k, + v, + mask, + batch_size, + seq_len, + end_pos + ); + } + defer attention_output.deinit(); + + // Final projection + var attention_flat = try attention_output.reshape( + .{batch_size, seq_len, self.n_heads * self.v_head_dim} + ); + defer attention_flat.deinit(); + + return self.wo.forward(attention_flat); + } + + // Flash attention implementation optimized for large contexts + fn computeFlashAttention( + self: *const Self, + q_nope: Tensor(f32, 4), + q_pe: Tensor(f32, 4), + kv_cache: Tensor(f32, 4), + rope_cache: Tensor(f32, 3), + mask: ?Tensor(f32, 2), + batch_size: usize, + seq_len: usize, + end_pos: usize + ) !Tensor(f32, 4) { + // Flash attention implementation with tiling to maximize cache efficiency + // This function would include a highly optimized SIMD implementation + // specializing in memory-efficient attention computation + + // Note: This would be a substantial implementation with memory-efficient + // blocked matrix multiplication and careful SIMD optimization + // We're providing a simplified structure here + + // For a full implementation, see the FlashAttention algorithm paper + const block_size = 32; // Block size tuned for L1 cache + + // Output tensor + var output = try Tensor(f32, 4).init( + self.allocator, + .{batch_size, seq_len, self.n_heads, self.v_head_dim} + ); + + // Implement blocked attention algorithm... + // This would contain optimized SIMD code for tiled attention computation + + return output; + } + + // Standard attention for shorter sequences or when flash attention is disabled + fn computeStandardAttention( + self: *const Self, + q: Tensor(f32, 4), + k: Tensor(f32, 4), + v: Tensor(f32, 4), + mask: ?Tensor(f32, 2), + batch_size: usize, + seq_len: usize, + end_pos: usize + ) !Tensor(f32, 4) { + // Compute QK attention scores + var scores = try computeAttentionScores(q, k, self.softmax_scale); + defer scores.deinit(); + + // Apply causal mask if provided + if (mask) |m| { + try applyAttentionMask(&scores, m); + } + + // Apply softmax + try applySoftmax(&scores, -1); + + // Compute attention output (scores @ v) + return computeAttentionOutput(scores, v); + } + + // Update KV cache with new values + fn updateKVCache( + self: *Self, + k: Tensor(f32, 4), + v: Tensor(f32, 4), + start_pos: usize, + end_pos: usize + ) !void { + const batch_size = k.shape[0]; + const seq_len = k.shape[1]; + + // Update key cache + for (0..batch_size) |b| { + for (0..seq_len) |s| { + const cache_pos = start_pos + s; + for (0..self.n_heads) |h| { + // Copy K values + for (0..self.qk_head_dim) |d| { + const k_val = try k.at(.{b, s, h, d}); + try self.kv_cache.?.set(.{b, cache_pos, h, d}, k_val); + } + + // Copy V values + for (0..self.v_head_dim) |d| { + const v_val = try v.at(.{b, s, h, d}); + try self.kv_cache.?.set(.{b, cache_pos, h, self.qk_head_dim + d}, v_val); + } + } + } + } + } + }; +} +``` + +**Key Optimizations:** +- **Compile-Time Specialization**: Generated attention routines are tailored to model dimensions at compile time +- **Flash Attention Algorithm**: Memory-efficient attention computation for long sequences +- **SIMD-Optimized Matrix Operations**: Vectorized attention score calculation and softmax +- **Optimized KV-Cache Layout**: Cache-friendly memory layout for efficient sequence generation +- **Sparse Attention Patterns**: Support for different attention patterns beyond standard causal attention +- **Memory Reuse**: Careful tensor management to minimize allocations during inference +- **Specialized Attention Paths**: Different implementations optimized for inference vs. training +- **Low-Rank Adaptation**: LoRA support for more efficient fine-tuning + +#### 2.3 Mixture of Experts (MoE) + +The Mixture of Experts (MoE) architecture is a key innovation in DeepSeek V3 that enables scaling model capacity without proportionally increasing computation cost. Our Zig implementation leverages compile-time specialization and parallel execution for maximum efficiency: + +```zig +// Generic MoE implementation with compile-time specialization +pub fn MixtureOfExperts(comptime args: ModelArgs) type { + return struct { + const Self = @This(); + const ModelType = args.getModelType(); + + // Configuration + allocator: std.mem.Allocator, + dim: usize, + n_routed_experts: usize, + n_local_experts: usize, + n_activated_experts: usize, + experts_start_idx: usize, + experts_end_idx: usize, + use_parallel_execution: bool, + + // Components + gate: RouterGate(args), + experts: []Expert(args), + shared_experts: MLP(args), + thread_pool: ?*ComputeThreadPool = null, + + // Initialize MoE with appropriate configuration + pub fn init(allocator: std.mem.Allocator) !Self { + // Determine expert distribution across processes + const world_size = 1; // Set to actual world size for distributed training + const rank = 0; // Set to actual rank for distributed training + + std.debug.assert(args.n_routed_experts % world_size == 0, + "Number of experts must be divisible by world size"); + + const n_local_experts = args.n_routed_experts / world_size; + const experts_start_idx = rank * n_local_experts; + const experts_end_idx = experts_start_idx + n_local_experts; + + // Initialize routing gate + var gate = try RouterGate(args).init(allocator); + errdefer gate.deinit(); + + // Initialize experts + var experts = try allocator.alloc(Expert(args), args.n_routed_experts); + errdefer allocator.free(experts); + + // Only initialize experts that belong to this process + for (experts, 0..) |*expert, i| { + if (experts_start_idx <= i and i < experts_end_idx) { + expert.* = try Expert(args).init(allocator); + } else { + expert.* = undefined; // Not used on this process + } + } + + // Initialize shared experts (always executed) + var shared_experts = try MLP(args).init( + allocator, + args.dim, + args.n_shared_experts * args.moe_inter_dim + ); + errdefer shared_experts.deinit(); + + // Initialize thread pool for parallel execution if needed + var thread_pool: ?*ComputeThreadPool = null; + if (args.use_parallel_experts) { + thread_pool = try allocator.create(ComputeThreadPool); + const cpu_count = try std.Thread.getCpuCount(); + const optimal_threads = std.math.min( + cpu_count, + args.n_activated_experts + args.n_shared_experts + ); + thread_pool.?.* = try ComputeThreadPool.init(optimal_threads); + } + + return Self{ + .allocator = allocator, + .dim = args.dim, + .n_routed_experts = args.n_routed_experts, + .n_local_experts = n_local_experts, + .n_activated_experts = args.n_activated_experts, + .experts_start_idx = experts_start_idx, + .experts_end_idx = experts_end_idx, + .use_parallel_execution = args.use_parallel_experts, + .gate = gate, + .experts = experts, + .shared_experts = shared_experts, + .thread_pool = thread_pool, + }; + } + + pub fn deinit(self: *Self) void { + self.gate.deinit(); + + // Only deinit experts that belong to this process + for (self.experts, 0..) |*expert, i| { + if (self.experts_start_idx <= i and i < self.experts_end_idx) { + expert.deinit(); + } + } + self.allocator.free(self.experts); + + self.shared_experts.deinit(); + + if (self.thread_pool) |pool| { + pool.deinit(); + self.allocator.destroy(pool); + } + } + + // Forward pass implementation with parallel expert execution + pub fn forward(self: *Self, x: Tensor(f32, 3)) !Tensor(f32, 3) { + const batch_size = x.shape[0]; + const seq_len = x.shape[1]; + + // Reshape input for routing + var x_flat = try x.reshape(.{batch_size * seq_len, self.dim}); + defer x_flat.deinit(); + + // Router computation + var router_output = try self.gate.forward(x_flat); + defer { + router_output.weights.deinit(); + router_output.indices.deinit(); + } + + // Get routing weights and indices + const weights = router_output.weights; + const indices = router_output.indices; + + // Initialize result tensor with zeros + var result = try Tensor(f32, 2).init( + self.allocator, + .{batch_size * seq_len, self.dim} + ); + errdefer result.deinit(); + + @memset(result.data, 0); + + // Count expert assignments for load balancing analysis + var expert_counts = try self.allocator.alloc(usize, self.n_routed_experts); + defer self.allocator.free(expert_counts); + @memset(expert_counts, 0); + + for (indices.data) |idx| { + expert_counts[idx] += 1; + } + + // Process each expert + if (self.use_parallel_execution and self.thread_pool != null) { + try self.parallelExpertExecution( + x_flat, + weights, + indices, + expert_counts, + &result + ); + } else { + try self.sequentialExpertExecution( + x_flat, + weights, + indices, + expert_counts, + &result + ); + } + + // Always execute shared experts + var shared_output = try self.shared_experts.forward(x_flat); + defer shared_output.deinit(); + + // Add shared expert output to result + try addTensors(&result, shared_output); + + // Reshape back to original dimensions + return result.reshape(.{batch_size, seq_len, self.dim}); + } + + // Parallel execution of experts using thread pool + fn parallelExpertExecution( + self: *Self, + x: Tensor(f32, 2), + weights: Tensor(f32, 2), + indices: Tensor(usize, 2), + expert_counts: []usize, + result: *Tensor(f32, 2) + ) !void { + const thread_pool = self.thread_pool.?; + var work_queue = std.ArrayList(ExpertWorkItem).init(self.allocator); + defer work_queue.deinit(); + + // Create work items for each expert + for (0..self.n_routed_experts) |expert_idx| { + if (expert_counts[expert_idx] == 0) continue; + + if (expert_idx < self.experts_start_idx or expert_idx >= self.experts_end_idx) { + // Skip experts not assigned to this process + continue; + } + + // Extract tokens routed to this expert + var token_indices = try self.allocator.alloc(usize, expert_counts[expert_idx]); + var token_weights = try self.allocator.alloc(f32, expert_counts[expert_idx]); + + var token_count: usize = 0; + for (0..x.shape[0]) |i| { + for (0..self.n_activated_experts) |j| { + const index_offset = i * self.n_activated_experts + j; + if (indices.data[index_offset] == expert_idx) { + token_indices[token_count] = i; + token_weights[token_count] = weights.data[index_offset]; + token_count += 1; + } + } + } + + // Create work item + try work_queue.append(.{ + .allocator = self.allocator, + .expert = &self.experts[expert_idx], + .x = x, + .token_indices = token_indices, + .token_weights = token_weights, + .result = result, + .thread_pool = thread_pool, + }); + } + + // Schedule parallel expert execution + for (work_queue.items) |*work_item| { + // Increment completion counter + _ = thread_pool.completion_count.fetchAdd(1, .Release); + + // Submit task to thread pool + try thread_pool.compute(processExpertWork, work_item); + } + + // Wait for all expert computations to complete + thread_pool.waitAll(); + } + + // Sequential execution of experts + fn sequentialExpertExecution( + self: *Self, + x: Tensor(f32, 2), + weights: Tensor(f32, 2), + indices: Tensor(usize, 2), + expert_counts: []usize, + result: *Tensor(f32, 2) + ) !void { + // Process each expert sequentially + for (0..self.n_routed_experts) |expert_idx| { + if (expert_counts[expert_idx] == 0) continue; + + if (expert_idx < self.experts_start_idx or expert_idx >= self.experts_end_idx) { + // Skip experts not assigned to this process + continue; + } + + // Get tokens assigned to this expert + for (0..x.shape[0]) |i| { + for (0..self.n_activated_experts) |j| { + const index_offset = i * self.n_activated_experts + j; + if (indices.data[index_offset] == expert_idx) { + // Process token with this expert + const token_weight = weights.data[index_offset]; + + // Extract input token + var token_input = try x.slice(.{i, 0}, .{i + 1, self.dim}); + defer token_input.deinit(); + + // Process through expert + var expert_output = try self.experts[expert_idx].forward(token_input); + defer expert_output.deinit(); + + // Scale by routing weight + try scaleTensor(&expert_output, token_weight); + + // Add to result + for (0..self.dim) |d| { + result.data[i * self.dim + d] += expert_output.data[d]; + } + } + } + } + } + } + + // Worker task for parallel expert execution + const ExpertWorkItem = struct { + allocator: std.mem.Allocator, + expert: *Expert(args), + x: Tensor(f32, 2), + token_indices: []usize, + token_weights: []f32, + result: *Tensor(f32, 2), + thread_pool: *ComputeThreadPool, + }; + + fn processExpertWork(ctx_ptr: *anyopaque) void { + const ctx = @ptrCast(*ExpertWorkItem, @alignCast(@alignOf(ExpertWorkItem), ctx_ptr)); + defer { + ctx.allocator.free(ctx.token_indices); + ctx.allocator.free(ctx.token_weights); + _ = ctx.thread_pool.completion_count.fetchSub(1, .Release); + } + + // Process each token assigned to this expert + for (ctx.token_indices, ctx.token_weights, 0..) |token_idx, weight, i| { + // Extract input token + var token_input = ctx.x.slice(.{token_idx, 0}, .{token_idx + 1, ctx.x.shape[1]}) catch return; + defer token_input.deinit(); + + // Process through expert + var expert_output = ctx.expert.forward(token_input) catch return; + defer expert_output.deinit(); + + // Scale by routing weight + scaleTensor(&expert_output, weight) catch return; + + // Add to result (using atomic operations to avoid race conditions) + for (0..expert_output.shape[1]) |d| { + const offset = token_idx * expert_output.shape[1] + d; + const old_val = @atomicLoad(f32, &ctx.result.data[offset], .Acquire); + const new_val = old_val + expert_output.data[d]; + @atomicStore(f32, &ctx.result.data[offset], new_val, .Release); + } + } + } + }; +} + +// Router gate for MoE that determines which experts to use for each token +pub fn RouterGate(comptime args: ModelArgs) type { + return struct { + const Self = @This(); + + allocator: std.mem.Allocator, + dim: usize, + n_experts: usize, + n_groups: usize, + n_limited_groups: usize, + topk: usize, + score_func: enum { softmax, sigmoid }, + route_scale: f32, + + // Router weights + weight: Tensor(f32, 2), + bias: ?Tensor(f32, 1) = null, + + pub fn init(allocator: std.mem.Allocator) !Self { + var weight = try Tensor(f32, 2).init( + allocator, + .{args.n_routed_experts, args.dim} + ); + + // Initialize with appropriate distribution + try initializeParameters(&weight, 0.0, 0.02); + + // Create optional bias + var bias: ?Tensor(f32, 1) = null; + if (args.dim == 7168) { // Special case for bias + bias = try Tensor(f32, 1).init(allocator, .{args.n_routed_experts}); + @memset(bias.?.data, 0); + } + + return Self{ + .allocator = allocator, + .dim = args.dim, + .n_experts = args.n_routed_experts, + .n_groups = args.n_expert_groups, + .n_limited_groups = args.n_limited_groups, + .topk = args.n_activated_experts, + .score_func = args.score_func, + .route_scale = args.route_scale, + .weight = weight, + .bias = bias, + }; + } + + pub fn deinit(self: *Self) void { + self.weight.deinit(); + if (self.bias) |*b| b.deinit(); + } + + // Router forward pass to determine expert assignment + pub fn forward(self: *const Self, x: Tensor(f32, 2)) !RouterOutput { + // Compute routing scores + var scores = try linearProjection(x, self.weight, self.bias); + defer scores.deinit(); + + // Apply scoring function + var routing_probs: Tensor(f32, 2) = undefined; + if (self.score_func == .softmax) { + routing_probs = try applySoftmax(scores, 1); + } else { + routing_probs = try applySigmoid(scores); + } + defer routing_probs.deinit(); + + // Save original scores for later + var original_scores = try routing_probs.clone(); + + // Expert group handling + if (self.n_groups > 1) { + try self.applyGroupFiltering(&routing_probs); + } + + // Select top-k experts + var indices = try Tensor(usize, 2).init( + self.allocator, + .{x.shape[0], self.topk} + ); + + var weights = try Tensor(f32, 2).init( + self.allocator, + .{x.shape[0], self.topk} + ); + + try self.selectTopkExperts(routing_probs, original_scores, &indices, &weights); + + // Apply routing scale + if (self.route_scale != 1.0) { + try scaleTensor(&weights, self.route_scale); + } + + return RouterOutput{ + .weights = weights, + .indices = indices, + }; + } + + // Apply expert group filtering + fn applyGroupFiltering(self: *const Self, scores: *Tensor(f32, 2)) !void { + // Reshape scores for group processing + const batch_size = scores.shape[0]; + const experts_per_group = self.n_experts / self.n_groups; + + var reshaped_scores = try scores.reshape( + .{batch_size, self.n_groups, experts_per_group} + ); + defer reshaped_scores.deinit(); + + // Compute group scores + var group_scores = try Tensor(f32, 2).init( + self.allocator, + .{batch_size, self.n_groups} + ); + defer group_scores.deinit(); + + // Calculate score for each group + if (self.bias == null) { + // Use max score as group score + for (0..batch_size) |b| { + for (0..self.n_groups) |g| { + var max_score: f32 = -std.math.inf_f32; + for (0..experts_per_group) |e| { + const score = try reshaped_scores.at(.{b, g, e}); + if (score > max_score) max_score = score; + } + try group_scores.set(.{b, g}, max_score); + } + } + } else { + // Use sum of top-2 scores as group score + for (0..batch_size) |b| { + for (0..self.n_groups) |g| { + var scores_arr = try self.allocator.alloc(f32, experts_per_group); + defer self.allocator.free(scores_arr); + + // Extract scores for this group + for (0..experts_per_group) |e| { + scores_arr[e] = try reshaped_scores.at(.{b, g, e}); + } + + // Sort to find top-2 + std.sort.sort(f32, scores_arr, {}, std.sort.desc(f32)); + + // Sum top-2 scores + const group_score = scores_arr[0] + scores_arr[1]; + try group_scores.set(.{b, g}, group_score); + } + } + } + + // Find top-k groups + var top_groups = try Tensor(usize, 2).init( + self.allocator, + .{batch_size, self.n_limited_groups} + ); + defer top_groups.deinit(); + + // Select top-k groups + for (0..batch_size) |b| { + var scores_arr = try self.allocator.alloc(struct { score: f32, idx: usize }, self.n_groups); + defer self.allocator.free(scores_arr); + + // Prepare for sorting + for (0..self.n_groups) |g| { + scores_arr[g] = .{ + .score = try group_scores.at(.{b, g}), + .idx = g, + }; + } + + // Sort by score + const Sort = struct { + fn desc(context: void, a: anytype, b: anytype) bool { + return a.score > b.score; + } + }; + std.sort.sort(struct { score: f32, idx: usize }, scores_arr, {}, Sort.desc); + + // Store top-k group indices + for (0..self.n_limited_groups) |i| { + try top_groups.set(.{b, i}, scores_arr[i].idx); + } + } + + // Create mask for filtering + var mask = try Tensor(bool, 3).init( + self.allocator, + .{batch_size, self.n_groups, 1} + ); + defer mask.deinit(); + + // Initialize all groups as masked (excluded) + @memset(mask.data, true); + + // Unmask top groups + for (0..batch_size) |b| { + for (0..self.n_limited_groups) |i| { + const g = try top_groups.at(.{b, i}); + try mask.set(.{b, g, 0}, false); + } + } + + // Apply mask + for (0..batch_size) |b| { + for (0..self.n_groups) |g| { + const is_masked = try mask.at(.{b, g, 0}); + if (is_masked) { + // Mask out this group by setting scores to -inf + for (0..experts_per_group) |e| { + try reshaped_scores.set(.{b, g, e}, -std.math.inf_f32); + } + } + } + } + + // Reshape back to original shape + try scores.copyFrom(reshaped_scores.reshape(.{batch_size, self.n_experts}) catch unreachable); + } + + // Select top-k experts based on routing scores + fn selectTopkExperts( + self: *const Self, + scores: Tensor(f32, 2), + original_scores: Tensor(f32, 2), + indices: *Tensor(usize, 2), + weights: *Tensor(f32, 2) + ) !void { + const batch_size = scores.shape[0]; + + for (0..batch_size) |b| { + var scores_arr = try self.allocator.alloc(struct { score: f32, idx: usize }, self.n_experts); + defer self.allocator.free(scores_arr); + + // Prepare for sorting + for (0..self.n_experts) |e| { + scores_arr[e] = .{ + .score = try scores.at(.{b, e}), + .idx = e, + }; + } + + // Sort by score + const Sort = struct { + fn desc(context: void, a: anytype, b: anytype) bool { + return a.score > b.score; + } + }; + std.sort.sort(struct { score: f32, idx: usize }, scores_arr, {}, Sort.desc); + + // Store top-k indices and get weights from original scores + for (0..self.topk) |i| { + const expert_idx = scores_arr[i].idx; + try indices.set(.{b, i}, expert_idx); + + // Get weight from original scores + const weight = try original_scores.at(.{b, expert_idx}); + try weights.set(.{b, i}, weight); + } + + // Normalize weights for sigmoid scoring + if (self.score_func == .sigmoid) { + var sum: f32 = 0.0; + for (0..self.topk) |i| { + sum += try weights.at(.{b, i}); + } + + if (sum > 0.0) { + for (0..self.topk) |i| { + const w = try weights.at(.{b, i}); + try weights.set(.{b, i}, w / sum); + } + } + } + } + } + }; +} + +// Output from router gate +pub const RouterOutput = struct { + weights: Tensor(f32, 2), // [batch_size, topk] + indices: Tensor(usize, 2), // [batch_size, topk] +}; +``` + +**Key Features:** +- **Compile-Time Specialization**: Generated MoE implementation tailored to model dimensions and configuration +- **Parallel Expert Execution**: Efficient multi-threading with work distribution and load balancing +- **Atomic Operations**: Thread-safe updates to shared tensors +- **Group-Based Routing**: Optimized implementation of expert groups for more efficient routing +- **Memory-Efficient Tensor Management**: Careful handling of temporary allocations +- **Flexible Scoring Functions**: Support for both softmax and sigmoid routing +- **Expert Load Balancing**: Runtime tracking of expert utilization +- **Distributed Expert Sharding**: Support for distributing experts across multiple processes + +### 3. Computation Backend + +Outlining the computation backend architecture for the DeepSeek-V3 project implemented in Zig. The design emphasizes performance, modularity, and hardware portability. + +#### 3.1 Backend Interface + +The backend interface provides a unified abstraction layer for all computation targets while maintaining Zig's zero-cost abstraction philosophy. + +```zig +pub const ComputeBackend = struct { + const Self = @This(); + + // Function pointers for backend-specific operations with proper type safety + matmulFn: *const fn(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) anyerror!void, + softmaxFn: *const fn(x: anytype, dim: usize, allocator: std.mem.Allocator) anyerror!void, + rmsnormFn: *const fn(x: anytype, weight: anytype, eps: f32, allocator: std.mem.Allocator) anyerror!void, + attentionFn: *const fn(q: anytype, k: anytype, v: anytype, mask: ?anytype, scale: f32, allocator: std.mem.Allocator) anyerror!void, + // Other operations... + + // Configuration for the backend + config: BackendConfig, + + // Dispatch based on backend type with proper error handling + pub fn matmul(self: *const Self, a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { + return self.matmulFn(a, b, c, allocator); + } + + pub fn softmax(self: *const Self, x: anytype, dim: usize, allocator: std.mem.Allocator) !void { + return self.softmaxFn(x, dim, allocator); + } + + pub fn rmsnorm(self: *const Self, x: anytype, weight: anytype, eps: f32, allocator: std.mem.Allocator) !void { + return self.rmsnormFn(x, weight, eps, allocator); + } + + pub fn attention(self: *const Self, q: anytype, k: anytype, v: anytype, mask: ?anytype, scale: f32, allocator: std.mem.Allocator) !void { + return self.attentionFn(q, k, v, mask, scale, allocator); + } +}; +``` + +#### 3.2 Platform-Specific Implementations + +```zig +pub const CPUBackend = struct { + allocator: std.mem.Allocator, + thread_pool: ?*ThreadPool, + + pub fn init(allocator: std.mem.Allocator, thread_count: ?usize) !ComputeBackend { + const thread_pool = if (thread_count) |count| { + try ThreadPool.init(allocator, .{ .thread_count = count }); + } else null; + + return ComputeBackend{ + .matmulFn = cpuMatmul, + .softmaxFn = cpuSoftmax, + .rmsnormFn = cpuRmsnorm, + .attentionFn = cpuAttention, + // Other operations... + .config = BackendConfig{ + .backend_type = .Cpu, + .max_threads = thread_count, + // Other CPU-specific config... + }, + }; + } + + fn cpuMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { + // Dynamically select the optimal implementation based on matrix dimensions and CPU features + if (c.rows * c.cols > 1024 * 1024 and detectCpuFeatures().use_avx2) { + return cpuMatmulParallel(a, b, c, allocator); + } + return cpuMatmulSIMD(a, b, c, allocator); + } + + fn cpuSoftmax(x: anytype, dim: usize, allocator: std.mem.Allocator) !void { + // Optimized CPU implementation using SIMD + // Implementation details... + } + + // Other CPU-specific implementations... +}; + +pub const MetalBackend = struct { + device: *MTLDevice, + command_queue: *MTLCommandQueue, + library: *MTLLibrary, + allocator: std.mem.Allocator, + pipelines: PipelineCache, + + pub fn init(allocator: std.mem.Allocator) !ComputeBackend { + // Initialize Metal device, command queue, and library + const device = MTLCreateSystemDefaultDevice() orelse return error.MetalDeviceNotAvailable; + const command_queue = device.newCommandQueue() orelse return error.CommandQueueCreationFailed; + + // Load compute shaders from embedded metal code or compiled library + const library = try loadDefaultLibrary(device); + + // Initialize pipeline cache + var pipelines = PipelineCache.init(allocator); + try pipelines.precompileEssentialPipelines(device, library); + + return ComputeBackend{ + .matmulFn = metalMatmul, + .softmaxFn = metalSoftmax, + .rmsnormFn = metalRmsnorm, + .attentionFn = metalAttention, + // Other operations... + .config = BackendConfig{ + .backend_type = .Metal, + .workgroup_size = .{16, 16, 1}, + .shared_memory_size = 32 * 1024, + // Other Metal-specific config... + }, + }; + } + + fn metalMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { + // Implementation using Metal Performance Shaders when available + // Fallback to custom compute kernel for specialized operations + // Implementation details... + } + + fn metalSoftmax(x: anytype, dim: usize, allocator: std.mem.Allocator) !void { + // Metal implementation + // Implementation details... + } + + // Other Metal-specific implementations... +}; +``` + +**Key Features:** +- Abstract interface with compile-time type safety +- Proper error handling with Zig's error system +- Zero-cost abstraction for backend dispatch +- Dynamic backend selection based on available hardware +- Specialized implementations for different hardware architectures +- Thread pool integration for CPU parallelism +- Resource management for GPU backends +- Pipeline caching for improved performance + + +#### 3.3 SIMD Vectorization + +DeepSeek-V3 leverages Zig's built-in vector types to achieve high-performance computation across different architectures. + +```zig +// Define vector types with architecture-specific sizes +pub fn VectorType(comptime T: type, comptime len: usize) type { + return @Vector(len, T); +} + +// Compile-time determination of optimal vector size +pub fn getOptimalVectorSize(comptime T: type) usize { + const target = @import("builtin").target; + + // Determine vector size based on architecture and data type + if (T == f32) { + if (target.cpu.arch == .x86_64 or target.cpu.arch == .x86) { + if (target.cpu.features.isEnabled(.avx512f)) { + return 16; // 512 bits / 32 bits = 16 elements + } else if (target.cpu.features.isEnabled(.avx2)) { + return 8; // 256 bits / 32 bits = 8 elements + } else if (target.cpu.features.isEnabled(.sse4_1)) { + return 4; // 128 bits / 32 bits = 4 elements + } + } else if (target.cpu.arch == .aarch64) { + if (target.cpu.features.isEnabled(.neon)) { + return 4; // 128 bits / 32 bits = 4 elements + } + } + } else if (T == f16) { + // Similar logic for f16 with doubled vector sizes + // ... + } + + // Default fallback + return 4; +} + +// Example of SIMD matrix multiplication +pub fn matrixMultiplySIMD(comptime T: type, a: []const T, b: []const T, c: []T, m: usize, n: usize, k: usize) void { + const vec_size = comptime getOptimalVectorSize(T); + const Vec = VectorType(T, vec_size); + + // Process blocks that align with vector size + const k_vec = k / vec_size * vec_size; + + for (0..m) |i| { + for (0..n) |j| { + var sum: T = 0; + var vec_sum: Vec = @splat(0); + + // Vector part + var kv: usize = 0; + while (kv < k_vec) : (kv += vec_size) { + const a_vec = blk: { + var tmp: Vec = undefined; + for (0..vec_size) |v| { + tmp[v] = a[i * k + kv + v]; + } + break :blk tmp; + }; + + const b_vec = blk: { + var tmp: Vec = undefined; + for (0..vec_size) |v| { + tmp[v] = b[kv + v + j * k]; + } + break :blk tmp; + }; + + vec_sum += a_vec * b_vec; + } + + // Reduce vector + for (0..vec_size) |v| { + sum += vec_sum[v]; + } + + // Remaining elements + for (k_vec..k) |kk| { + sum += a[i * k + kk] * b[kk + j * k]; + } + + c[i * n + j] = sum; + } + } +} +``` + +#### 3.4 Runtime CPU Feature Detection + +```zig +pub fn detectCpuFeatures() BackendConfig { + var config = BackendConfig{ + .backend_type = BackendType.Cpu, + }; + + // Try to detect CPU features at runtime + const cpu_info = std.zig.system.getCpuInfo() catch { + // Fallback to safe defaults if detection fails + return config; + }; + + // Configure based on detected features + config.use_avx512 = cpu_info.features.isEnabled(.avx512f); + config.use_avx2 = cpu_info.features.isEnabled(.avx2); + config.use_sse4_1 = cpu_info.features.isEnabled(.sse4_1); + config.use_neon = cpu_info.features.isEnabled(.neon); + + return config; +} +``` + +#### 3.5 Backend Configuration + +Backend configuration allows fine-tuning performance characteristics based on hardware capabilities and workload requirements. + +```zig +pub const BackendType = enum { + Cpu, + Cuda, + Metal, + Vulkan, + WebGPU, +}; + +pub const BackendConfig = struct { + backend_type: BackendType, + max_threads: ?usize = null, + cache_line_size: usize = 64, // Default x86-64 cache line size + use_avx512: bool = false, // Use AVX-512 when available + use_avx2: bool = true, // Use AVX2 when available + use_sse4_1: bool = true, // Use SSE4.1 when available + use_neon: bool = false, // Use ARM NEON when available + prefetch_distance: usize = 8, // Prefetch N cache lines ahead + tiling_size: ?[2]usize = null, // Matrix tiling dimensions + batch_size: ?usize = null, // Batch size for kernel operations + memory_pool_size: ?usize = null, // Size of pre-allocated memory pool + use_half_precision: bool = false, // Use FP16 where appropriate + use_mixed_precision: bool = true, // Use mixed precision for matmul + + // GPU-specific options + workgroup_size: ?[3]usize = null, // GPU workgroup dimensions + shared_memory_size: ?usize = null, // GPU shared memory allocation + compute_queue_depth: usize = 3, // Maximum concurrent compute operations +}; +``` + +#### 3.6 GPU Integration + +DeepSeek-V3 supports multiple GPU backends, with specialized implementations for each platform. + +#### 3.6.1 CUDA Backend + +```zig +pub const CudaBackend = struct { + allocator: std.mem.Allocator, + device: i32, + stream: ?*anyopaque, + handles: CudaHandles, + module_cache: ModuleCache, + + pub fn init(allocator: std.mem.Allocator, device_id: ?i32) !ComputeBackend { + // Initialize CUDA device, context, and stream + const device = if (device_id) |id| id else try getOptimalCudaDevice(); + try cudaSetDevice(device); + + var stream: ?*anyopaque = null; + try checkCudaStatus(cudaStreamCreate(&stream)); + + // Initialize cuBLAS and cuDNN handles + var handles = try CudaHandles.init(stream); + + // Compile and cache essential CUDA kernels + var module_cache = try ModuleCache.init(allocator); + try module_cache.compileEssentialKernels(); + + return ComputeBackend{ + .matmulFn = cudaMatmul, + .softmaxFn = cudaSoftmax, + .rmsnormFn = cudaRmsnorm, + .attentionFn = cudaAttention, + // Other operations... + .config = BackendConfig{ + .backend_type = .Cuda, + .workgroup_size = .{16, 16, 1}, + .shared_memory_size = 48 * 1024, + // Other CUDA-specific config... + }, + }; + } + + fn cudaMatmul(a: anytype, b: anytype, c: *anytype, allocator: std.mem.Allocator) !void { + // Use cuBLAS for large matrices + // Fall back to custom kernels for specialized operations + // Implementation details... + } + + // Other CUDA-specific implementations... +}; +``` + +#### 3.6.2 Vulkan Backend + +```zig +pub const VulkanBackend = struct { + allocator: std.mem.Allocator, + instance: vk.Instance, + physical_device: vk.PhysicalDevice, + device: vk.Device, + compute_queue: vk.Queue, + command_pool: vk.CommandPool, + pipeline_cache: vk.PipelineCache, + shader_modules: ShaderModuleCache, + + pub fn init(allocator: std.mem.Allocator) !ComputeBackend { + // Initialize Vulkan instance, device, and queues + // Implementation details... + + return ComputeBackend{ + .matmulFn = vulkanMatmul, + .softmaxFn = vulkanSoftmax, + .rmsnormFn = vulkanRmsnorm, + .attentionFn = vulkanAttention, + // Other operations... + .config = BackendConfig{ + .backend_type = .Vulkan, + // Vulkan-specific config... + }, + }; + } + + // Vulkan-specific implementations... +}; +``` + +#### 3.7 Quantization Framework + +The quantization framework enables efficient model deployment through reduced precision arithmetic. + +```zig +// Supported quantization methods +pub const QuantizationMethod = enum { + None, + FP16, // Half precision + Int8, // 8-bit integer quantization + Int4, // 4-bit integer quantization + NF4, // NormalFloat4 quantization + GPTQ, // GPTQ quantization + AWQ, // Activation-aware weight quantization +}; + +// Quantization configuration +pub const QuantConfig = struct { + method: QuantizationMethod = .None, + scale_type: ?type = null, // Type for quantization scales + group_size: usize = 128, // Size of quantization groups + bits: u8 = 8, // Bits per quantized value + symmetric: bool = false, // Symmetric vs asymmetric quantization + + // Calibration parameters + calibration_dataset: ?[]const u8 = null, + num_calibration_samples: usize = 128, + + // Sparsity options + use_sparse: bool = false, + sparsity_threshold: f32 = 0.01, +}; + +// Abstract quantizer interface +pub const Quantizer = struct { + const Self = @This(); + + quantizeFn: *const fn(self: *Self, tensor: Tensor, config: QuantConfig, allocator: std.mem.Allocator) anyerror!Tensor, + dequantizeFn: *const fn(self: *Self, tensor: Tensor, allocator: std.mem.Allocator) anyerror!Tensor, + + pub fn quantize(self: *Self, tensor: Tensor, config: QuantConfig, allocator: std.mem.Allocator) !Tensor { + return self.quantizeFn(self, tensor, config, allocator); + } + + pub fn dequantize(self: *Self, tensor: Tensor, allocator: std.mem.Allocator) !Tensor { + return self.dequantizeFn(self, tensor, allocator); + } +}; +``` + +#### 3.8 Memory Management + +Efficient memory management is crucial for large language model inference. + +```zig +// Memory allocation strategy +pub const AllocStrategy = enum { + Default, // Standard allocator + Arena, // Arena allocator for bulk allocations + Pool, // Memory pool for fixed-size allocations + Streaming, // Streaming allocator for pipelined operations + Pinned, // Pinned memory for efficient host-device transfers +}; + +// Memory pool for efficient tensor allocations +pub const TensorMemoryPool = struct { + const Self = @This(); + + parent_allocator: std.mem.Allocator, + pool: std.heap.MemoryPool, + block_sizes: []const usize, + blocks: std.AutoArrayHashMap(usize, std.ArrayList(*anyopaque)), + mutex: std.Thread.Mutex, + stats: MemoryStats, + + pub fn init(allocator: std.mem.Allocator, config: MemoryPoolConfig) !Self { + // Initialize memory pool with predefined block sizes + // Implementation details... + } + + pub fn allocate(self: *Self, size: usize, alignment: usize) ![]u8 { + // Find the appropriate block size or allocate directly + // Implementation details... + } + + pub fn free(self: *Self, ptr: []u8) void { + // Return to pool or free directly + // Implementation details... + } + + // Memory management utilities + pub fn preallocate(self: *Self, block_size: usize, count: usize) !void { + // Preallocate multiple blocks of the specified size + // Implementation details... + } + + pub fn reclaim(self: *Self) void { + // Reclaim unused memory blocks + // Implementation details... + } +}; + +// Key-Value cache management for efficient inference +pub const KVCache = struct { + allocator: std.mem.Allocator, + k_cache: Tensor, + v_cache: Tensor, + capacity: usize, + size: usize, + head_dim: usize, + num_heads: usize, + + pub fn init(allocator: std.mem.Allocator, batch_size: usize, num_heads: usize, head_dim: usize, max_seq_len: usize) !Self { + // Initialize key-value cache with appropriate dimensions + // Implementation details... + } + + pub fn append(self: *Self, k: Tensor, v: Tensor, pos: usize) !void { + // Append new key-value pairs to the cache + // Implementation details... + } + + pub fn prefill(self: *Self, k: Tensor, v: Tensor) !void { + // Prefill the cache with initial key-value pairs + // Implementation details... + } + + pub fn rotatePositions(self: *Self, positions: []const usize) !void { + // Rearrange cache entries based on position IDs (for speculative decoding) + // Implementation details... + } + + pub fn clear(self: *Self) void { + // Reset the cache size without deallocating memory + // Implementation details... + } +}; +``` + +#### 3.9 Metal Integration for Apple Silicon + +Modern Apple Silicon devices offer exceptional compute performance, and our Zig implementation takes full advantage of these capabilities through direct Metal API integration: + +```zig +pub const MetalBackend = struct { + const Self = @This(); + + // Core Metal resources + device: *MTLDevice, + command_queue: *MTLCommandQueue, + library: *MTLLibrary, + + // Pipeline cache for reusing compiled compute pipelines + pipeline_cache: std.AutoHashMap(u64, *MTLComputePipelineState), + + // Memory management + allocator: std.mem.Allocator, + buffer_pool: BufferPool, + + // Configuration and statistics + config: BackendConfig, + stats: MetalStatistics, + + pub fn init(allocator: std.mem.Allocator) !*Self { + // Get the default Metal device + var device = MTLCreateSystemDefaultDevice(); + if (device == null) return error.MetalDeviceNotAvailable; + + // Create a command queue for submitting work to the GPU + var command_queue = device.?.newCommandQueue(); + if (command_queue == null) return error.MetalCommandQueueCreationFailed; + + // Compile our Metal shader library from source or load precompiled metallib + var library: ?*MTLLibrary = null; + if (comptime @import("builtin").mode == .Debug) { + // Compile from source for easier debugging + library = try compileLibraryFromSource(device.?, shader_source); + } else { + // Use precompiled metallib for release builds + const metallib_path = try findMetalLibPath(allocator); + defer allocator.free(metallib_path); + + library = try loadCompiledLibrary(device.?, metallib_path); + } + + // Create the Metal backend + var self = try allocator.create(Self); + errdefer allocator.destroy(self); + + // Initialize the pipeline cache + var pipeline_cache = std.AutoHashMap(u64, *MTLComputePipelineState).init(allocator); + errdefer pipeline_cache.deinit(); + + // Initialize the buffer pool for efficient memory reuse + var buffer_pool = try BufferPool.init(allocator, device.?); + errdefer buffer_pool.deinit(); + + // Get optimal configuration based on the device capabilities + var config = try getMetalOptimalConfig(device.?); + + self.* = .{ + .device = device.?, + .command_queue = command_queue.?, + .library = library.?, + .pipeline_cache = pipeline_cache, + .allocator = allocator, + .buffer_pool = buffer_pool, + .config = config, + .stats = MetalStatistics.init(), + }; + + return self; + } + + pub fn deinit(self: *Self) void { + // Release all cached pipelines + var it = self.pipeline_cache.valueIterator(); + while (it.next()) |pipeline| { + pipeline.*.release(); + } + self.pipeline_cache.deinit(); + + // Clean up buffer pool + self.buffer_pool.deinit(); + + // Release Metal resources + self.library.release(); + self.command_queue.release(); + self.device.release(); + + // Free memory + self.allocator.destroy(self); + } + + // Get or create a compute pipeline for a function + pub fn getPipeline(self: *Self, function_name: []const u8) !*MTLComputePipelineState { + // Hash the function name for quick lookup + const hash = std.hash.CityHash64.hash(function_name); + + // Check if we already have a cached pipeline + if (self.pipeline_cache.get(hash)) |pipeline| { + return pipeline; + } + + // Create a new pipeline if not found + var function = self.library.newFunctionWithName(function_name); + if (function == null) return error.MetalFunctionNotFound; + defer function.?.release(); + + // Create the compute pipeline + var pipeline_desc = MTLComputePipelineDescriptor.alloc().init(); + defer pipeline_desc.release(); + + pipeline_desc.setComputeFunction(function.?); + + // Enable buffer mutability tracking in debug mode + if (comptime @import("builtin").mode == .Debug) { + pipeline_desc.setMutabilityOptions(.{ + .MTLPipelineBufferMutabilityAccessTracking = true, + }); + } + + // Enable threadgroup memory length optimization + pipeline_desc.setThreadGroupSizeIsMultipleOfThreadExecutionWidth(true); + + // Create the pipeline state + var error_ptr: ?*NSError = null; + var pipeline = self.device.newComputePipelineStateWithDescriptor( + pipeline_desc, + .MTLPipelineOptionArgumentInfo, + null, + &error_ptr + ); + + if (pipeline == null) { + if (error_ptr != null) { + // Log the error details + const error_str = error_ptr.?.localizedDescription().UTF8String(); + std.log.err("Failed to create pipeline for {s}: {s}", .{ + function_name, error_str, + }); + error_ptr.?.release(); + } + return error.MetalPipelineCreationFailed; + } + + // Cache the pipeline for future use + try self.pipeline_cache.put(hash, pipeline.?); + + return pipeline.?; + } + + // Execute a compute kernel with the given parameters + pub fn executeKernel( + self: *Self, + kernel_name: []const u8, + grid_size: [3]u32, + block_size: [3]u32, + buffers: []const MetalBuffer, + wait_until_completed: bool, + ) !void { + // Get the pipeline for this kernel + var pipeline = try self.getPipeline(kernel_name); + + // Create a command buffer + var command_buffer = self.command_queue.commandBuffer(); + if (command_buffer == null) return error.MetalCommandBufferCreationFailed; + + // Create a compute command encoder + var encoder = command_buffer.?.computeCommandEncoder(); + if (encoder == null) return error.MetalComputeEncoderCreationFailed; + + // Set the compute pipeline + encoder.?.setComputePipelineState(pipeline); + + // Bind buffers + for (buffers, 0..) |buffer, i| { + encoder.?.setBuffer(buffer.handle, buffer.offset, @intCast(i)); + } + + // Calculate threadgroup size + var threadgroup_size = MTLSize{ + .width = block_size[0], + .height = block_size[1], + .depth = block_size[2], + }; + + // Calculate grid size + var grid = MTLSize{ + .width = grid_size[0], + .height = grid_size[1], + .depth = grid_size[2], + }; + + // Dispatch the compute work + encoder.?.dispatchThreadgroups(grid, threadgroup_size); + + // End encoding + encoder.?.endEncoding(); + + // Commit the command buffer + command_buffer.?.commit(); + + // Wait for completion if requested + if (wait_until_completed) { + command_buffer.?.waitUntilCompleted(); + } + + // Update statistics + self.stats.kernel_executions += 1; + } + + // Create a buffer and copy data to it + pub fn createBuffer( + self: *Self, + data: []const u8, + options: MTLResourceOptions, + ) !*MTLBuffer { + // Get a buffer from the pool or create a new one + var buffer = try self.buffer_pool.getBuffer(data.len, options); + + // Copy data to the buffer + @memcpy(buffer.contents()[0..data.len], data); + + return buffer; + } + + // Create a tensor in Metal memory + pub fn createTensor(self: *Self, tensor: Tensor(f32, 2)) !MetalTensor { + // Calculate size in bytes + const size_bytes = tensor.data.len * @sizeOf(f32); + + // Create a buffer + var buffer = try self.createBuffer( + @ptrCast([*]const u8, tensor.data.ptr)[0..size_bytes], + .StorageModeShared + ); + + return MetalTensor{ + .buffer = buffer, + .shape = tensor.shape, + .element_type = .f32, + }; + } + + // Example implementation of matrix multiplication using Metal + pub fn matmul( + self: *Self, + a: Tensor(f32, 2), + b: Tensor(f32, 2), + ) !Tensor(f32, 2) { + // Validate dimensions + std.debug.assert(a.shape[1] == b.shape[0], "Incompatible matrix dimensions"); + + const m = a.shape[0]; + const k = a.shape[1]; + const n = b.shape[1]; + + // Create result tensor + var result = try Tensor(f32, 2).init(self.allocator, .{m, n}); + errdefer result.deinit(); + + // Create Metal tensors + var a_metal = try self.createTensor(a); + defer a_metal.buffer.release(); + + var b_metal = try self.createTensor(b); + defer b_metal.buffer.release(); + + var result_metal = try self.createTensor(result); + defer result_metal.buffer.release(); + + // Create dimension buffer + const dims = [_]u32{@intCast(m), @intCast(k), @intCast(n)}; + var dims_buffer = try self.createBuffer( + @ptrCast([*]const u8, &dims)[0..dims.len * @sizeOf(u32)], + .StorageModeShared + ); + defer dims_buffer.release(); + + // Set up buffers + const buffers = [_]MetalBuffer{ + .{ .handle = a_metal.buffer, .offset = 0 }, + .{ .handle = b_metal.buffer, .offset = 0 }, + .{ .handle = result_metal.buffer, .offset = 0 }, + .{ .handle = dims_buffer, .offset = 0 }, + }; + + // Calculate optimal workgroup size + const workgroup_size: [3]u32 = if (self.config.workgroup_size) |ws| + .{ @intCast(ws[0]), @intCast(ws[1]), 1 } + else + .{ 16, 16, 1 }; + + // Calculate grid size + const grid_size: [3]u32 = .{ + (n + workgroup_size[0] - 1) / workgroup_size[0], + (m + workgroup_size[1] - 1) / workgroup_size[1], + 1, + }; + + // Execute the kernel + try self.executeKernel( + "matmul", + grid_size, + workgroup_size, + &buffers, + true + ); + + // Copy data back from Metal + @memcpy( + result.data, + @ptrCast([*]const f32, result_metal.buffer.contents())[0..result.data.len] + ); + + return result; + } +}; + +// Efficient buffer pooling to avoid frequent allocations +pub const BufferPool = struct { + const Self = @This(); + + allocator: std.mem.Allocator, + device: *MTLDevice, + free_buffers: std.AutoHashMap(u64, std.ArrayList(*MTLBuffer)), + + pub fn init(allocator: std.mem.Allocator, device: *MTLDevice) !Self { + return Self{ + .allocator = allocator, + .device = device, + .free_buffers = std.AutoHashMap(u64, std.ArrayList(*MTLBuffer)).init(allocator), + }; + } + + pub fn deinit(self: *Self) void { + // Release all buffers + var it = self.free_buffers.valueIterator(); + while (it.next()) |buffer_list| { + for (buffer_list.items) |buffer| { + buffer.release(); + } + buffer_list.deinit(); + } + self.free_buffers.deinit(); + } + + // Get a buffer of at least the requested size + pub fn getBuffer(self: *Self, size: usize, options: MTLResourceOptions) !*MTLBuffer { + // Round up to power of 2 for better reuse + const aligned_size = nextPowerOfTwo(size); + + // Check if we have a free buffer of appropriate size + if (self.free_buffers.getPtr(aligned_size)) |buffer_list| { + if (buffer_list.items.len > 0) { + // Reuse an existing buffer + return buffer_list.pop(); + } + } + + // Create a new buffer if none available + var buffer = self.device.newBufferWithLength(aligned_size, options); + if (buffer == null) return error.MetalBufferAllocationFailed; + + return buffer.?; + } + + // Return a buffer to the pool for reuse + pub fn releaseBuffer(self: *Self, buffer: *MTLBuffer) !void { + const size = buffer.length(); + const aligned_size = nextPowerOfTwo(size); + + // Add to the appropriate size list + if (self.free_buffers.getPtr(aligned_size)) |buffer_list| { + try buffer_list.append(buffer); + } else { + // Create a new list if this is the first buffer of this size + var buffer_list = std.ArrayList(*MTLBuffer).init(self.allocator); + try buffer_list.append(buffer); + try self.free_buffers.put(aligned_size, buffer_list); + } + } + + // Utility to find next power of two + fn nextPowerOfTwo(n: usize) usize { + var v = n; + v -= 1; + v |= v >> 1; + v |= v >> 2; + v |= v >> 4; + v |= v >> 8; + v |= v >> 16; + v |= v >> 32; + v += 1; + return v; + } +}; + +// Representation of a tensor in Metal memory +pub const MetalTensor = struct { + buffer: *MTLBuffer, + shape: []const usize, + element_type: enum { + f16, + f32, + }, +}; + +// Helper for buffer binding +pub const MetalBuffer = struct { + handle: *MTLBuffer, + offset: u64 = 0, +}; + +// Statistics for performance monitoring +pub const MetalStatistics = struct { + kernel_executions: usize = 0, + bytes_transferred: usize = 0, + peak_memory_usage: usize = 0, + + pub fn init() MetalStatistics { + return .{}; + } +}; + +// Example Metal shader source for matrix multiplication +const shader_source = + \\#include + \\using namespace metal; + \\ + \\kernel void matmul( + \\ const device float* a [[buffer(0)]], + \\ const device float* b [[buffer(1)]], + \\ device float* result [[buffer(2)]], + \\ const device uint* dims [[buffer(3)]], + \\ uint2 gid [[thread_position_in_grid]], + \\ uint2 lid [[thread_position_in_threadgroup]], + \\ uint2 lsize [[threads_per_threadgroup]]) + \\{ + \\ const uint m = dims[0]; + \\ const uint k = dims[1]; + \\ const uint n = dims[2]; + \\ + \\ // Check if within bounds + \\ if (gid.x >= n || gid.y >= m) return; + \\ + \\ // Calculate result[gid.y][gid.x] + \\ float sum = 0.0f; + \\ for (uint i = 0; i < k; i++) { + \\ sum += a[gid.y * k + i] * b[i * n + gid.x]; + \\ } + \\ + \\ result[gid.y * n + gid.x] = sum; + \\} + \\ + \\kernel void matmul_optimized( + \\ const device float* a [[buffer(0)]], + \\ const device float* b [[buffer(1)]], + \\ device float* result [[buffer(2)]], + \\ const device uint* dims [[buffer(3)]], + \\ uint2 gid [[thread_position_in_grid]], + \\ uint2 lid [[thread_position_in_threadgroup]], + \\ uint2 lsize [[threads_per_threadgroup]]) + \\{ + \\ const uint m = dims[0]; + \\ const uint k = dims[1]; + \\ const uint n = dims[2]; + \\ + \\ // Check if within bounds + \\ if (gid.x >= n || gid.y >= m) return; + \\ + \\ // Use threadgroup memory for caching + \\ threadgroup float a_cache[16][16]; + \\ threadgroup float b_cache[16][16]; + \\ + \\ float sum = 0.0f; + \\ + \\ // Process in tiles + \\ for (uint tile = 0; tile < (k + 15) / 16; tile++) { + \\ // Load a tile into threadgroup memory + \\ const uint tile_idx = tile * 16; + \\ + \\ if (tile_idx + lid.x < k && gid.y < m) { + \\ a_cache[lid.y][lid.x] = a[gid.y * k + tile_idx + lid.x]; + \\ } else { + \\ a_cache[lid.y][lid.x] = 0.0f; + \\ } + \\ + \\ if (tile_idx + lid.y < k && gid.x < n) { + \\ b_cache[lid.y][lid.x] = b[(tile_idx + lid.y) * n + gid.x]; + \\ } else { + \\ b_cache[lid.y][lid.x] = 0.0f; + \\ } + \\ + \\ // Wait for all threads to load data + \\ threadgroup_barrier(mem_flags::mem_threadgroup); + \\ + \\ // Compute partial dot product for this tile + \\ for (uint i = 0; i < 16; i++) { + \\ sum += a_cache[lid.y][i] * b_cache[i][lid.x]; + \\ } + \\ + \\ // Wait for all threads to finish using the cached data + \\ threadgroup_barrier(mem_flags::mem_threadgroup); + \\ } + \\ + \\ // Write result + \\ if (gid.x < n && gid.y < m) { + \\ result[gid.y * n + gid.x] = sum; + \\ } + \\} +; +``` + +**Apple-Specific Optimizations:** + +1. **Metal Shader Integration** + - Direct compilation of Metal shaders from Zig source code + - Runtime shader compilation in debug mode for easier iteration + - Precompiled metallib loading for optimized release builds + +2. **Memory Management** + - Buffer pooling to minimize allocations and deallocations + - Shared memory mode for zero-copy between CPU and GPU + - Explicit control over resource storage options + +3. **Performance Optimizations** + - Tile-based computation for optimal cache utilization + - Threadgroup memory usage for shared data access + - Work distribution based on detected GPU characteristics + - Pipeline state caching for faster kernel dispatching + +4. **AMX Acceleration** + - Support for Apple Matrix extensions (AMX) + - Specialized matrix multiplication operations for M-series chips + - Custom shader variants optimized for different Apple Silicon generations + +5. **Neural Engine Integration** + - Optional ANE (Apple Neural Engine) offloading for supported operations + - Hybrid execution strategies combining GPU and Neural Engine + - Automatic fallback to Metal for unsupported operations + + +### 4. Inference Pipeline + +The inference pipeline is the core execution flow for running the DeepSeek V3 model. Our Zig implementation focuses on efficiency, flexibility, and streaming capabilities. + +#### 4.1 Model Loading + +```zig +// The ModelLoader handles loading and initializing DeepSeek V3 models +pub const ModelLoader = struct { + const Self = @This(); + + allocator: std.mem.Allocator, + config: LoaderConfig, + + // Configuration for model loading + pub const LoaderConfig = struct { + // Number of threads to use for weight loading + loading_threads: ?usize = null, + + // Optional cache directory for model weights + cache_dir: ?[]const u8 = null, + + // How to handle safetensors format + safetensors_memory_map: bool = true, + + // Validation level for loaded weights + validation: enum { + none, + basic, + full + } = .basic, + + // Device to place model on after loading + target_device: BackendType = .Cpu, + }; + + pub fn init(allocator: std.mem.Allocator, config: LoaderConfig) Self { + return .{ + .allocator = allocator, + .config = config, + }; + } + + // Load a model from file + pub fn loadModel( + self: *Self, + path: []const u8, + model_args: ?ModelArgs, + ) !*TransformerModel { + const extension = std.fs.path.extension(path); + + // Determine model format from file extension + if (std.mem.eql(u8, extension, ".safetensors")) { + return try self.loadFromSafetensors(path, model_args); + } else if (std.mem.eql(u8, extension, ".ckpt")) { + return try self.loadFromCheckpoint(path, model_args); + } else if (std.mem.eql(u8, extension, ".bin")) { + return try self.loadFromBinary(path, model_args); + } else if (std.fs.cwd().accessZ(path, .{}) == .AccessDenied) { + // Could be a Hugging Face model ID, try to download it + return try self.loadFromHuggingFace(path, model_args); + } + + return error.UnsupportedModelFormat; + } + + // Load model from SafeTensors format (optimized for memory mapping) + fn loadFromSafetensors( + self: *Self, + path: []const u8, + model_args: ?ModelArgs, + ) !*TransformerModel { + // Open the safetensors file + var file = try std.fs.cwd().openFile(path, .{}); + defer file.close(); + + // Memory map the file for zero-copy access if configured + if (self.config.safetensors_memory_map) { + const file_size = try file.getEndPos(); + + // Memory map the file + const mapped_memory = try std.os.mmap( + null, + file_size, + std.os.PROT.READ, + std.os.MAP.PRIVATE, + file.handle, + 0, + ); + + // Process the memory-mapped safetensors + return try self.processSafetensorsMemoryMapped( + mapped_memory, + file_size, + model_args, + ); + } else { + // If memory mapping is disabled, read the file conventionally + return try self.processSafetensorsFile(file, model_args); + } + } + + // Process a memory-mapped SafeTensors file + fn processSafetensorsMemoryMapped( + self: *Self, + memory: []const u8, + file_size: usize, + model_args: ?ModelArgs, + ) !*TransformerModel { + // Parse the header which contains tensor metadata + const header_size = std.mem.readIntLittle(u64, memory[0..8]); + const header_json = memory[8..8+header_size]; + + // Parse the JSON header + var parsed = try std.json.parseFromSlice( + std.json.Value, + self.allocator, + header_json, + .{}, + ); + defer parsed.deinit(); + + // Get the model configuration from arguments or try to infer it + const args = try self.determineModelArgs(model_args, parsed.value); + + // Create the model with the determined configuration + var model = try TransformerModel.create(self.allocator, args); + errdefer model.destroy(); + + // Create a tensor mapping for zero-copy loading + try self.loadTensorsFromSafetensorsMemory( + model, + memory, + header_size, + parsed.value, + ); + + // Validate the loaded model if configured + if (self.config.validation != .none) { + try self.validateModel(model, parsed.value); + } + + return model; + } + + // Load a model from Hugging Face + fn loadFromHuggingFace( + self: *Self, + model_id: []const u8, + model_args: ?ModelArgs, + ) !*TransformerModel { + // Get cache directory or create a temporary one + const cache_dir = self.config.cache_dir orelse + try std.fs.getAppDataDir(self.allocator, "deepseek-zig"); + + // Create HF client + var hf_client = try HuggingFaceClient.init(self.allocator, cache_dir); + defer hf_client.deinit(); + + // Download the model + const model_path = try hf_client.downloadModel(model_id); + + // Load the downloaded model + return try self.loadModel(model_path, model_args); + } + + // Infer model arguments if not explicitly provided + fn determineModelArgs( + self: *Self, + model_args: ?ModelArgs, + header: std.json.Value, + ) !ModelArgs { + if (model_args) |args| { + return args; + } + + // Try to infer model configuration from the weight shapes + if (header.Object.get("metadata")) |metadata| { + if (metadata.Object.get("model_type")) |model_type| { + if (std.mem.eql(u8, model_type.String, "deepseek")) { + // Extract dimensions from metadata + return try self.parseDeepSeekConfig(metadata); + } + } + } + + // Infer from weight shapes if metadata is not available + return try self.inferArgsFromWeights(header); + } + + // ... more implementation details ... +}; + +// Implementation of TransformerModel +pub const TransformerModel = struct { + const Self = @This(); + + allocator: std.mem.Allocator, + args: ModelArgs, + + // Tokenizer for text processing + tokenizer: *Tokenizer, + + // Model components + embedding: *Embedding, + layers: []TransformerLayer, + norm: *LayerNorm, + lm_head: *Linear, + + // KV cache for efficient inference + kv_cache: ?*KVCache, + + // Backend for computation + backend: *ComputeBackend, + + // Create a model with the given configuration + pub fn create( + allocator: std.mem.Allocator, + args: ModelArgs, + ) !*Self { + // Create model components + var embedding = try Embedding.create(allocator, args); + errdefer embedding.destroy(); + + var layers = try allocator.alloc(TransformerLayer, args.num_layers); + errdefer allocator.free(layers); + + for (layers, 0..) |*layer, i| { + layer.* = try TransformerLayer.create(allocator, args, i); + } + + var norm = try LayerNorm.create(allocator, args.dim); + errdefer norm.destroy(); + + var lm_head = try Linear.create(allocator, args.dim, args.vocab_size); + errdefer lm_head.destroy(); + + // Initialize compute backend + var backend = try ComputeBackend.create(allocator); + errdefer backend.destroy(); + + // Initialize tokenizer + var tokenizer = try Tokenizer.create(allocator, args.vocab_size); + errdefer tokenizer.destroy(); + + // Create the model + var model = try allocator.create(Self); + errdefer allocator.destroy(model); + + model.* = .{ + .allocator = allocator, + .args = args, + .tokenizer = tokenizer, + .embedding = embedding, + .layers = layers, + .norm = norm, + .lm_head = lm_head, + .kv_cache = null, + .backend = backend, + }; + + return model; + } + + // Clean up resources + pub fn destroy(self: *Self) void { + // Free all components + self.tokenizer.destroy(); + self.embedding.destroy(); + + for (self.layers) |*layer| { + layer.deinit(); + } + self.allocator.free(self.layers); + + self.norm.destroy(); + self.lm_head.destroy(); + + if (self.kv_cache) |kv_cache| { + kv_cache.destroy(); + } + + self.backend.destroy(); + self.allocator.destroy(self); + } + + // Load a model from a specific path + pub fn loadFromPath( + allocator: std.mem.Allocator, + path: []const u8, + args: ?ModelArgs, + ) !*Self { + var loader = ModelLoader.init(allocator, .{}); + return try loader.loadModel(path, args); + } + + // Forward pass for a single token + pub fn forward( + self: *Self, + token_id: usize, + position: usize, + ) !Tensor(f32, 2) { + // Get the token embedding + var x = try self.embedding.forward(token_id); + + // Process through all transformer layers + for (self.layers, 0..) |*layer, i| { + x = try layer.forward(x, position, self.kv_cache); + } + + // Apply final layer norm + x = try self.norm.forward(x); + + // Project to vocabulary + return try self.lm_head.forward(x); + } + + // Prepare the model for generation + pub fn prepareForGeneration( + self: *Self, + max_seq_len: usize, + batch_size: usize, + ) !void { + // Create KV cache if not already created + if (self.kv_cache == null) { + self.kv_cache = try KVCache.create( + self.allocator, + self.args, + max_seq_len, + batch_size, + ); + } else { + // Reset the cache if it already exists + try self.kv_cache.?.reset(max_seq_len, batch_size); + } + } + + // Load tokenizer from vocabulary file + pub fn loadTokenizer( + self: *Self, + path: []const u8, + ) !void { + try self.tokenizer.loadFromFile(path); + } +}; +``` + +#### 4.2 Generation Strategies + +```zig +// Configuration for text generation +pub const GenerationConfig = struct { + // Maximum new tokens to generate + max_new_tokens: usize = 128, + + // Sampling temperature (higher = more random) + temperature: f32 = 1.0, + + // Top-p sampling parameter (0.0-1.0) + top_p: f32 = 1.0, + + // Top-k sampling parameter (0 = disabled) + top_k: usize = 0, + + // Repetition penalty to prevent looping + repetition_penalty: f32 = 1.0, + + // Whether to use sampling or greedy decoding + do_sample: bool = true, + + // Frequency penalty for repeated tokens + frequency_penalty: f32 = 0.0, + + // Presence penalty for token occurrence + presence_penalty: f32 = 0.0, + + // Stop sequences to terminate generation + stop_sequences: ?[]const []const u8 = null, + + // Minimum number of tokens to generate + min_new_tokens: ?usize = null, + + // Beam search width (1 = greedy) + num_beams: usize = 1, + + // Random seed for reproducibility + seed: ?u64 = null, + + // Whether to use speculative decoding + use_speculative: bool = false, + + // Draft model for speculative decoding + draft_model: ?*TransformerModel = null, + + // Number of speculative tokens to generate at once + speculative_tokens: usize = 5, +}; + +// Generate text from a model given input tokens +pub fn generate( + model: *TransformerModel, + input_ids: []const usize, + config: GenerationConfig, + callback: ?fn ([]const u8) void, +) ![]usize { + // Initialize RNG with seed if provided + var rng = if (config.seed) |seed| + std.rand.DefaultPrng.init(seed) + else + std.rand.DefaultPrng.init(@bitCast(u64, std.time.milliTimestamp())); + + // Allocate result buffer + var result = try model.allocator.alloc( + usize, + input_ids.len + config.max_new_tokens, + ); + errdefer model.allocator.free(result); + + // Copy input tokens + @memcpy(result[0..input_ids.len], input_ids); + var token_count = input_ids.len; + + // Prepare model for generation + try model.prepareForGeneration( + input_ids.len + config.max_new_tokens, + 1, // Batch size + ); + + // Process all input tokens to fill KV cache + var position: usize = 0; + for (input_ids) |token_id| { + _ = try model.forward(token_id, position); + position += 1; + } + + // Check if we should use speculative decoding + if (config.use_speculative and config.draft_model != null) { + return try speculativeGenerate( + model, + config.draft_model.?, + result, + token_count, + position, + config, + callback, + ); + } + + // Set up logit processors based on config + var logit_processors = LogitProcessorList.init(model.allocator); + defer logit_processors.deinit(); + + if (config.temperature != 1.0) { + try logit_processors.add(TemperatureLogitProcessor.init(config.temperature)); + } + + if (config.repetition_penalty != 1.0) { + try logit_processors.add(RepetitionPenaltyLogitProcessor.init( + config.repetition_penalty, + result[0..token_count], + )); + } + + if (config.frequency_penalty != 0.0 or config.presence_penalty != 0.0) { + try logit_processors.add(FrequencyPenaltyLogitProcessor.init( + config.frequency_penalty, + config.presence_penalty, + )); + } + + // Main generation loop + while (token_count < result.len) { + // Get next token logits + var logits = try model.forward(result[token_count - 1], position); + defer logits.deinit(); + + // Apply logit processors + try logit_processors.process(&logits, result[0..token_count]); + + // Sample next token + const next_token = if (config.do_sample) + try sampleNextToken( + model.allocator, + logits, + config.top_p, + config.top_k, + &rng.random(), + ) + else + try greedyNextToken(logits); + + // Add token to result + result[token_count] = next_token; + token_count += 1; + position += 1; + + // Check for stop sequences + if (config.stop_sequences) |stop_seqs| { + if (checkStopSequences( + model.tokenizer, + result[0..token_count], + stop_seqs, + )) { + break; + } + } + + // Call callback with generated token if provided + if (callback != null) { + var token_text = try model.tokenizer.decodeTokens( + model.allocator, + result[token_count-1..token_count], + ); + defer model.allocator.free(token_text); + + callback.?(token_text); + } + + // Check if we've reached minimum token count + if (config.min_new_tokens) |min_tokens| { + if (token_count >= input_ids.len + min_tokens) { + // Check if we're at an EOS token + if (next_token == model.tokenizer.eos_token_id) { + break; + } + } + } else if (next_token == model.tokenizer.eos_token_id) { + // Otherwise just stop at EOS + break; + } + } + + // Resize result to actual number of tokens + result = try model.allocator.realloc(result, token_count); + return result; +} + +// Speculative decoding implementation +fn speculativeGenerate( + model: *TransformerModel, + draft_model: *TransformerModel, + result: []usize, + token_count: usize, + position: usize, + config: GenerationConfig, + callback: ?fn ([]const u8) void, +) ![]usize { + // Implementation of speculative decoding algorithm + // This generates multiple tokens using a smaller draft model + // and verifies them with the main model for faster generation + + // ... implementation details ... + return result; +} + +// Sample next token using top-p (nucleus) and top-k sampling +fn sampleNextToken( + allocator: std.mem.Allocator, + logits: Tensor(f32, 2), + top_p: f32, + top_k: usize, + random: *std.rand.Random, +) !usize { + const vocab_size = logits.shape[1]; + + // Create a sorted list of (token_id, probability) pairs + var token_probs = try allocator.alloc( + struct { token_id: usize, prob: f32 }, + vocab_size, + ); + defer allocator.free(token_probs); + + // Apply softmax to get probabilities + var probs = try softmax(allocator, logits); + defer probs.deinit(); + + // Fill token_probs array + for (0..vocab_size) |i| { + token_probs[i] = .{ + .token_id = i, + .prob = probs.data[i], + }; + } + + // Sort by probability (descending) + std.sort.sort( + struct { token_id: usize, prob: f32 }, + token_probs, + {}, + struct { + fn lessThan(_: void, a: struct { token_id: usize, prob: f32 }, b: struct { token_id: usize, prob: f32 }) bool { + return b.prob < a.prob; + } + }.lessThan, + ); + + // Apply top-k filtering if enabled + const k = if (top_k > 0) + @min(top_k, vocab_size) + else + vocab_size; + + // Apply top-p filtering + var cumulative_prob: f32 = 0.0; + var last_idx: usize = 0; + + for (token_probs[0..k], 0..) |tp, i| { + cumulative_prob += tp.prob; + if (cumulative_prob >= top_p) { + last_idx = i; + break; + } + } + + // Sample from the filtered distribution + const rand_val = random.float(f32); + var curr_prob: f32 = 0.0; + + for (token_probs[0..last_idx+1]) |tp| { + curr_prob += tp.prob; + if (rand_val < curr_prob) { + return tp.token_id; + } + } + + // Fallback to the highest probability token + return token_probs[0].token_id; +} +``` + +**Advanced Features:** + +1. **Speculative Decoding** + - Implementation of speculative decoding using a smaller draft model + - Verification and acceptance/rejection of speculated tokens + - Significant speedup in generation throughput + +2. **Streaming Token Output** + - Callback-based token streaming for real-time results + - Zero-copy token decoding for minimal overhead + - Support for incremental UI updates + +3. **Custom Sampling Strategies** + - Top-p (nucleus) sampling with dynamic probability mass cutoff + - Top-k sampling with configurable k value + - Temperature scaling for controlling randomness + - Repetition penalty to prevent loops and repetitive text + - Frequency and presence penalties for more diverse output + +4. **Stop Sequence Detection** + - Efficient detection of multiple stop sequences + - Support for subword token matching across boundaries + - Early termination based on generated content + +5. **Beam Search Implementation** + - Configurable beam width for exploring multiple generation paths + - Length normalization for balancing short and long outputs + - Diverse beam groups to prevent similar outputs + +6. **Memory Efficiency** + - KV-cache memory management for long context handling + - Incremental cache updates for streaming inference + - Automatic cache pruning for memory optimization + +7. **Performance Optimizations** + - Batched token processing for higher throughput + - Parallel sampling for multi-sequence generation + - SIMD-accelerated logit processing + - Compile-time specialization for common configuration patterns + +### 5. Optimization Layer + +The optimization layer leverages Zig's unique features to maximise performance across different hardware targets. + +#### 5.1 Compile-Time Optimizations + +Zig's powerful compile-time metaprogramming enables us to generate highly specialized code for specific hardware and model configurations: + +```zig +// Specialized matrix multiplication kernels generated at compile-time +pub fn generateMatmulKernel(comptime config: KernelConfig) type { + return struct { + const Self = @This(); + + // Compile-time configuration + const M = config.M; + const N = config.N; + const K = config.K; + const block_size = config.block_size; + const vector_width = config.vector_width; + const use_fma = config.use_fma; + + // Vector type based on configuration + const Vec = @Vector(vector_width, f32); + + // Matmul implementation specialized for the given dimensions + pub fn matmul( + a: *const [M][K]f32, + b: *const [K][N]f32, + c: *[M][N]f32, + ) void { + // Use specialized implementation for small matrices + if (comptime M <= 4 and N <= 4 and K <= 4) { + return smallMatmul(a, b, c); + } + + // Use blocked implementation for larger matrices + return blockedMatmul(a, b, c); + } + + // Specialized implementation for small matrices + // Fully unrolled at compile time + fn smallMatmul( + a: *const [M][K]f32, + b: *const [K][N]f32, + c: *[M][N]f32, + ) void { + inline for (0..M) |i| { + inline for (0..N) |j| { + var sum: f32 = 0; + inline for (0..K) |k| { + sum += a[i][k] * b[k][j]; + } + c[i][j] = sum; + } + } + } + + // Cache-blocked implementation for larger matrices + fn blockedMatmul( + a: *const [M][K]f32, + b: *const [K][N]f32, + c: *[M][N]f32, + ) void { + // Compute using blocks for better cache utilization + comptime var i_block: usize = 0; + inline while (i_block < M) : (i_block += block_size) { + comptime var j_block: usize = 0; + inline while (j_block < N) : (j_block += block_size) { + comptime var k_block: usize = 0; + inline while (k_block < K) : (k_block += block_size) { + const i_end = @min(i_block + block_size, M); + const j_end = @min(j_block + block_size, N); + const k_end = @min(k_block + block_size, K); + + // Process current block + for (i_block..i_end) |i| { + for (j_block..j_end) |j| { + var sum: f32 = c[i][j]; + + // Vectorized inner loop when possible + if (comptime vector_width > 1 and (k_end - k_block) >= vector_width) { + var k_vec: usize = k_block; + var acc: Vec = @splat(0.0); + + while (k_vec + vector_width <= k_end) : (k_vec += vector_width) { + const a_vec: Vec = blk: { + var tmp: [vector_width]f32 = undefined; + for (0..vector_width) |vi| { + tmp[vi] = a[i][k_vec + vi]; + } + break :blk tmp; + }; + + const b_vec: Vec = blk: { + var tmp: [vector_width]f32 = undefined; + for (0..vector_width) |vi| { + tmp[vi] = b[k_vec + vi][j]; + } + break :blk tmp; + }; + + // Use FMA instruction if available + if (comptime use_fma) { + acc = @mulAdd(Vec, a_vec, b_vec, acc); + } else { + acc += a_vec * b_vec; + } + } + + // Reduce vector to scalar + for (0..vector_width) |vi| { + sum += acc[vi]; + } + + // Handle remaining elements + for (k_vec..k_end) |k| { + sum += a[i][k] * b[k][j]; + } + } else { + // Scalar fallback + for (k_block..k_end) |k| { + sum += a[i][k] * b[k][j]; + } + } + + c[i][j] = sum; + } + } + } + } + } + } + }; +} + +// Configuration for kernel generation +pub const KernelConfig = struct { + // Matrix dimensions (can be comptime_int or dynamic) + M: comptime_int, + N: comptime_int, + K: comptime_int, + + // Blocking configuration for cache optimization + block_size: comptime_int = 32, + + // Vector width for SIMD operations + vector_width: comptime_int = 4, + + // Whether to use FMA instructions when available + use_fma: bool = true, +}; + +// Usage: Create specialized kernels at compile time +// Fully unrolled 4x4 matrix multiplication +const Kernel4x4 = generateMatmulKernel(.{ + .M = 4, + .N = 4, + .K = 4, + .vector_width = 4, +}); + +// Cache-friendly 128x128 matrix multiplication +const Kernel128x128 = generateMatmulKernel(.{ + .M = 128, + .N = 128, + .K = 128, + .block_size = 32, + .vector_width = 8, +}); + +// Runtime dispatch to select the best kernel based on matrix dimensions +pub fn dispatchMatmul( + allocator: std.mem.Allocator, + a: Tensor(f32, 2), + b: Tensor(f32, 2), +) !Tensor(f32, 2) { + // Check dimensions + const m = a.shape[0]; + const k = a.shape[1]; + const n = b.shape[1]; + + std.debug.assert(k == b.shape[0], "Incompatible matrix dimensions"); + + // Create result tensor + var result = try Tensor(f32, 2).init(allocator, .{m, n}); + errdefer result.deinit(); + + // Initialize result to zeros + @memset(result.data, 0); + + // Dispatch to specialized kernels if dimensions match exactly + if (m == 4 and n == 4 and k == 4) { + // Use specialized 4x4 kernel + Kernel4x4.matmul( + @ptrCast(*const [4][4]f32, a.data), + @ptrCast(*const [4][4]f32, b.data), + @ptrCast(*[4][4]f32, result.data), + ); + } else if (m == 128 and n == 128 and k == 128) { + // Use specialized 128x128 kernel + Kernel128x128.matmul( + @ptrCast(*const [128][128]f32, a.data), + @ptrCast(*const [128][128]f32, b.data), + @ptrCast(*[128][128]f32, result.data), + ); + } else { + // Use generic implementation for arbitrary dimensions + try genericMatmul(a, b, &result); + } + + return result; +} + +// Apply compile-time metaprogramming to optimize data layouts +pub fn optimizedTensorLayout(comptime T: type, comptime dims: []const usize) type { + return struct { + const Self = @This(); + + // Determine optimal memory layout at compile time + const optimal_layout = optimizeMemoryLayout(T, dims); + + // Data storage with optimized layout + data: [product(dims)]T align(optimal_layout.alignment), + shape: [dims.len]usize, + strides: [dims.len]usize, + + // Tensor initialization with optimal layout + pub fn init(allocator: std.mem.Allocator) !Self { + const data = try allocator.alignedAlloc( + T, + optimal_layout.alignment, + product(dims), + ); + + // Calculate optimal strides based on layout + var strides: [dims.len]usize = undefined; + if (optimal_layout.row_major) { + // Row-major strides + var stride: usize = 1; + var i: usize = dims.len; + while (i > 0) { + i -= 1; + strides[i] = stride; + stride *= dims[i]; + } + } else { + // Column-major strides + var stride: usize = 1; + for (0..dims.len) |i| { + strides[i] = stride; + stride *= dims[i]; + } + } + + return Self{ + .data = data, + .shape = dims, + .strides = strides, + }; + } + + // Helper function to calculate optimal memory layout + fn optimizeMemoryLayout(comptime T: type, comptime dims: []const usize) struct { + row_major: bool, + alignment: u29, + } { + // Use column-major for matrices where the first dimension is much larger + // This often improves cache locality for common access patterns + const row_major = if (dims.len == 2) + dims[0] <= dims[1] * 2 + else + true; + + // Determine optimal alignment based on vector units + const alignment = if (@sizeOf(T) == 4 and comptime std.Target.current.cpu.arch == .x86_64) + if (comptime std.Target.current.cpu.features.isEnabled(.avx512f)) + 64 // 512-bit alignment for AVX-512 + else if (comptime std.Target.current.cpu.features.isEnabled(.avx2)) + 32 // 256-bit alignment for AVX2 + else if (comptime std.Target.current.cpu.features.isEnabled(.sse2)) + 16 // 128-bit alignment for SSE2 + else + @alignOf(T) + else + @alignOf(T); + + return .{ + .row_major = row_major, + .alignment = alignment, + }; + } + + // Helper to calculate the product of dimensions + fn product(comptime dims: []const usize) usize { + var result: usize = 1; + for (dims) |dim| { + result *= dim; + } + return result; + } + }; +} +``` + +**Key Compile-Time Techniques:** + +1. **Matrix Operation Specialization** + - Specialized kernels generated at compile-time for common dimensions + - Full loop unrolling for small matrices + - Compile-time configurable blocking strategies for cache optimization + +2. **Data Layout Optimization** + - Automatic selection of row-major or column-major layout based on dimensions + - Optimal memory alignment for target architecture's vector units + - Compile-time stride calculation for fast indexing + +3. **Architecture-Specific Optimizations** + - Vector width specialization based on target CPU features + - Automatic use of FMA instructions when available + - SIMD instruction generation tailored to the target architecture + +4. **Kernel Selection** + - Runtime dispatch to specialized kernels based on input dimensions + - Fallback to generic implementation for arbitrary dimensions + - Compile-time branch elimination for performance-critical paths + +#### 5.2 Quantization Framework + +Our quantization framework allows for efficient low-precision inference while maintaining accuracy: + +```zig +// Quantization configuration +pub const QuantizationConfig = struct { + // Precision of quantized values + bits: u8 = 8, + + // Quantization scheme + scheme: enum { + symmetric, // Zero-point is always 0, simplifies arithmetic + asymmetric, // Allows representing the full range more precisely + } = .symmetric, + + // Quantization granularity + granularity: enum { + per_tensor, // One scale for the entire tensor + per_channel, // Different scale for each output channel + } = .per_tensor, + + // Whether to use integer or float16 quantization + use_float16: bool = false, + + // Calibration strategy + calibration: enum { + minmax, // Simple min/max scaling + entropy, // Entropy-based quantization + percentile, // Clip to percentile range for outliers + } = .minmax, + + // Percentile value for calibration (0.0-1.0) + percentile: f32 = 0.99995, +}; + +// Quantized tensor type that tracks quantization parameters +pub fn QuantizedTensor(comptime original_type: type, comptime bits: u8) type { + return struct { + const Self = @This(); + + // Determine the appropriate integer type based on bit width + const IntType = std.meta.Int(.unsigned, bits); + + // Original element type for reference + pub const OriginalType = original_type; + + // Quantized data + data: []IntType, + + // Original tensor shape + shape: []const usize, + + // Quantization parameters + scale: []f32, + zero_point: []IntType, + + // Whether scale/zero_point are per-tensor or per-channel + per_channel: bool, + + // For asymmetric quantization: minimum representable value + qmin: IntType, + + // For asymmetric quantization: maximum representable value + qmax: IntType, + + // Channel dimension for per-channel quantization + channel_dim: ?usize, + + // Memory allocator for cleanup + allocator: std.mem.Allocator, + + // Initialize a quantized tensor + pub fn init( + allocator: std.mem.Allocator, + shape: []const usize, + per_channel: bool, + channel_dim: ?usize, + ) !Self { + // Calculate total size + var total_size: usize = 1; + for (shape) |dim| { + total_size *= dim; + } + + // Determine number of scales/zero_points needed + const param_size = if (per_channel) + shape[channel_dim.?] + else + 1; + + // Allocate memory + const data = try allocator.alloc(IntType, total_size); + errdefer allocator.free(data); + + const scale = try allocator.alloc(f32, param_size); + errdefer allocator.free(scale); + + const zero_point = try allocator.alloc(IntType, param_size); + errdefer allocator.free(zero_point); + + // Calculate quantization range + const qmin: IntType = 0; + const qmax: IntType = (1 << bits) - 1; + + // Create shape copy + const shape_copy = try allocator.dupe(usize, shape); + errdefer allocator.free(shape_copy); + + return Self{ + .data = data, + .shape = shape_copy, + .scale = scale, + .zero_point = zero_point, + .per_channel = per_channel, + .qmin = qmin, + .qmax = qmax, + .channel_dim = channel_dim, + .allocator = allocator, + }; + } + + // Free allocated memory + pub fn deinit(self: *Self) void { + self.allocator.free(self.data); + self.allocator.free(self.scale); + self.allocator.free(self.zero_point); + self.allocator.free(self.shape); + } + }; +} + +// Quantize a floating-point tensor to integer precision +pub fn quantize( + tensor: anytype, + config: QuantizationConfig, + allocator: std.mem.Allocator, +) !QuantizedTensor( + @TypeOf(tensor.data[0]), + config.bits, +) { + const T = @TypeOf(tensor.data[0]); + + // Validate input + if (config.bits > 16) { + return error.UnsupportedQuantizationBits; + } + + if (config.granularity == .per_channel and config.calibration != .minmax) { + return error.UnsupportedCombination; + } + + // Create quantized tensor + var channel_dim: ?usize = null; + if (config.granularity == .per_channel) { + // For per-channel quantization, use dimension 0 for vectors, + // dimension 1 for matrices (assuming CHW layout) + channel_dim = if (tensor.shape.len == 1) 0 else 1; + } + + var qtensor = try QuantizedTensor(T, config.bits).init( + allocator, + tensor.shape, + config.granularity == .per_channel, + channel_dim, + ); + errdefer qtensor.deinit(); + + // Different calibration strategies + switch (config.calibration) { + .minmax => try calibrateMinMax(&qtensor, tensor, config), + .entropy => try calibrateEntropy(&qtensor, tensor, config), + .percentile => try calibratePercentile(&qtensor, tensor, config), + } + + // Perform actual quantization + try quantizeTensor(&qtensor, tensor, config); + + return qtensor; +} + +// Dequantize a tensor back to floating point +pub fn dequantize( + qtensor: anytype, + allocator: std.mem.Allocator, +) !Tensor(@TypeOf(qtensor).OriginalType, qtensor.shape.len) { + const T = @TypeOf(qtensor).OriginalType; + + // Create tensor to hold dequantized values + var tensor = try Tensor(T, qtensor.shape.len).init( + allocator, + qtensor.shape, + ); + errdefer tensor.deinit(); + + // Dequantize values + if (qtensor.per_channel) { + const channel_dim = qtensor.channel_dim.?; + const channels = qtensor.shape[channel_dim]; + + // Calculate strides for traversing channels + var strides: []usize = try allocator.alloc(usize, qtensor.shape.len); + defer allocator.free(strides); + + var stride: usize = 1; + var i: usize = qtensor.shape.len; + while (i > 0) { + i -= 1; + strides[i] = stride; + stride *= qtensor.shape[i]; + } + + // Dequantize each element based on its channel + for (0..tensor.data.len) |idx| { + const channel_idx = (idx / strides[channel_dim]) % channels; + const scale = qtensor.scale[channel_idx]; + const zero_point = qtensor.zero_point[channel_idx]; + + tensor.data[idx] = @floatCast(T, + @intToFloat(f32, qtensor.data[idx] - zero_point) * scale + ); + } + } else { + // Per-tensor dequantization (simpler) + const scale = qtensor.scale[0]; + const zero_point = qtensor.zero_point[0]; + + for (0..tensor.data.len) |i| { + tensor.data[i] = @floatCast(T, + @intToFloat(f32, qtensor.data[i] - zero_point) * scale + ); + } + } + + return tensor; +} + +// Calibrate using simple min/max strategy +fn calibrateMinMax( + qtensor: anytype, + tensor: anytype, + config: QuantizationConfig, +) !void { + if (config.granularity == .per_tensor) { + // Find min/max across entire tensor + var min_val: f32 = std.math.inf_f32; + var max_val: f32 = -std.math.inf_f32; + + for (tensor.data) |val| { + const fval = @floatCast(f32, val); + min_val = @min(min_val, fval); + max_val = @max(max_val, fval); + } + + // Handle symmetric quantization + if (config.scheme == .symmetric) { + const abs_max = @max(@abs(min_val), @abs(max_val)); + min_val = -abs_max; + max_val = abs_max; + } + + // Calculate scale and zero_point + const range = max_val - min_val; + qtensor.scale[0] = range / @intToFloat(f32, qtensor.qmax - qtensor.qmin); + + if (config.scheme == .symmetric) { + qtensor.zero_point[0] = @divFloor(qtensor.qmax - qtensor.qmin, 2) + qtensor.qmin; + } else { + qtensor.zero_point[0] = @floatToInt( + @TypeOf(qtensor.zero_point[0]), + @round(qtensor.qmin - min_val / qtensor.scale[0]) + ); + } + } else { + // Per-channel quantization + // ... implementation details ... + } +} + +// Perform actual quantization +fn quantizeTensor( + qtensor: anytype, + tensor: anytype, + config: QuantizationConfig, +) !void { + if (qtensor.per_channel) { + // Per-channel quantization + // ... implementation details ... + } else { + // Per-tensor quantization + const scale = qtensor.scale[0]; + const zero_point = qtensor.zero_point[0]; + const qmin = qtensor.qmin; + const qmax = qtensor.qmax; + + for (0..tensor.data.len) |i| { + const val = @floatCast(f32, tensor.data[i]); + + // Quantize: x_q = round(x / scale) + zero_point + var q_val = @floatToInt( + @TypeOf(qtensor.data[0]), + @round(val / scale) + @intToFloat(f32, zero_point) + ); + + // Clamp to quantization range + q_val = @max(@min(q_val, qmax), qmin); + + qtensor.data[i] = q_val; + } + } +} +``` + +**Quantization Features:** + +1. **Multiple Precision Options** + - 8-bit quantization for maximum throughput + - 4-bit quantization for model compression + - 3-bit quantization for extreme size reduction + - FP16 quantization for memory bandwidth reduction with minimal accuracy loss + +2. **Flexible Quantization Schemes** + - Symmetric quantization for simpler arithmetic + - Asymmetric quantization for better range utilization + - Per-tensor quantization for speed + - Per-channel quantization for accuracy + +3. **Advanced Calibration Methods** + - Min/max calibration for simplicity + - Entropy-based calibration for better distribution representation + - Percentile-based calibration for outlier handling + +4. **Mixed-Precision Execution** + - Critical layers in higher precision for accuracy + - Non-critical layers in lower precision for speed + - Automatic precision selection based on sensitivity analysis + +5. **Hardware Acceleration** + - Optimized integer SIMD operations for quantized execution + - Specialized kernels for common quantized operations + - Hardware-specific optimizations for quantized compute + +## Platform-Specific Optimizations + +### Apple Silicon (M-Series) + +The DeepSeek V3 Zig implementation is highly optimized for Apple Silicon's unique architecture: + +1. **Metal Performance Shaders (MPS) Integration** + - Direct integration with Apple's Metal Performance Shaders for matrix operations + - Custom Metal compute kernels optimized for M-series chips + - Efficient memory sharing between CPU and GPU with zero-copy transfers + +2. **Tensor Core Utilization** + - Leveraging Matrix multiplication units in M-series chips + - Mixed-precision operations optimized for Apple Silicon + - Native FP16 support for improved throughput + +3. **AMX Instruction Set Access** + - Direct use of Apple Matrix extensions for accelerated linear algebra + - Low-level optimization of critical matrix operations + - Custom assembly routines for maximum performance + +4. **Memory Bandwidth Optimization** + - Unified memory architecture exploitation + - Cache-friendly memory access patterns + - Optimal tile sizes for M-series cache hierarchy + +5. **Power Efficiency Tuning** + - Dynamic performance/power scaling + - Efficient core utilization across P and E cores + - Background inference optimizations + +### x86_64 Architecture + +For x86_64 platforms, our implementation focuses on leveraging the latest instruction sets: + +1. **AVX-512 Vectorization** + - Full utilization of 512-bit vector operations + - Masked operations for efficient boundary handling + - FMA instruction usage for maximum throughput + +2. **Cache-Friendly Memory Layouts** + - Cache line aligned data structures + - Blocked algorithms optimized for typical L1/L2/L3 cache sizes + - Software prefetching for critical data paths + +3. **Thread Pool Optimization** + - Work-stealing scheduler for balanced multicore utilization + - NUMA-aware memory allocation and thread assignment + - Adaptive parallelism based on available cores + +4. **Dynamic Dispatch** + - Runtime CPU feature detection + - Specialized code paths for different instruction sets + - Fallback implementations for compatibility + +### NVIDIA GPUs + +NVIDIA GPU acceleration is implemented through an efficient CUDA integration: + +1. **CUDA Integration via FFI** + - Zero-overhead bindings to CUDA runtime + - Asynchronous kernel execution and memory transfers + - Efficient stream management for overlapping operations + +2. **Custom CUDA Kernels** + - Specialized kernels for attention mechanisms + - Optimized matrix multiplication for transformer layers + - Fused operations for reduced kernel launch overhead + +3. **Memory Management** + - Pinned memory for efficient transfers + - Memory pool for reduced allocation overhead + - Smart prefetching for predictable memory access patterns + +4. **Tensor Core Utilization** + - Mixed-precision operations using TensorCores + - Automatic kernel selection for tensor-core eligible operations + - Tensor Core compatible memory layouts + +## Development Roadmap + +### Phase 1: Core Infrastructure + +The initial phase focuses on establishing the foundational components: + +- **Memory Management System** + - Custom tensor allocator implementation + - Arena-based allocation strategies + - Error handling framework + +- **Tensor Implementation** + - Basic tensor operations and utilities + - SIMD-accelerated implementations + - Platform detection and optimization + +- **Computation Backend Interfaces** + - Abstract backend interfaces + - CPU backend implementation + - Initial Metal backend for Apple Silicon + +- **Error Handling Framework** + - Robust error propagation + - Detailed error reporting + - Resource cleanup guarantees + +### Phase 2: Model Architecture + +Building on the infrastructure, we implement the core model components: + +- **Transformer Layers** + - Multi-head attention implementation + - Feed-forward networks + - Layer normalization + +- **Attention Mechanisms** + - Standard attention implementation + - Flash attention optimizations + - Memory-efficient attention variants + +- **Mixture of Experts** + - Router implementation + - Parallel expert execution + - Load balancing mechanisms + +- **Embedding Systems** + - Token embeddings + - Position embeddings + - Rotary position embeddings + +### Phase 3: Backend Integration + +This phase extends compute capabilities across different hardware: + +- **CPU Backend** + - AVX-512 optimizations + - Thread pool implementation + - Cache-optimized algorithms + +- **Metal Backend** + - Complete Metal shader library + - Apple Neural Engine integration + - M-series specific optimizations + +- **CUDA Backend** + - NVIDIA GPU support + - Tensor Core optimizations + - Multi-GPU scaling + +- **Vulkan Backend** + - Cross-platform GPU support + - AMD GPU optimizations + - Intel GPU support + +### Phase 4: Inference Pipeline + +Creating the end-to-end inference system: + +- **Model Loading** + - SafeTensors format support + - Checkpoint loading + - Weight quantization + +- **Tokenization** + - Efficient tokenizer implementation + - Streaming tokenization + - Special token handling + +- **Generation Strategies** + - Sampling methods implementation + - Beam search + - Speculative decoding + +- **Output Processing** + - Token streaming + - Stop sequence handling + - Result formatting + +### Phase 5: Optimization + +Comprehensive optimization across the entire stack: + +- **Compile-Time Optimizations** + - Template specialization + - Kernel generation + - Custom data layouts + +- **Runtime Optimizations** + - Dynamic kernel selection + - Adaptive compute strategies + - Memory access optimizations + +- **Architecture-Specific Tuning** + - Platform-specific parameter tuning + - Hardware-specific kernel variants + - Feature detection and adaptation + +- **Quantization Framework** + - 8-bit quantization + - 4-bit quantization + - Mixed precision execution + +### Phase 6: Testing and Benchmarking + +Ensuring correctness and measuring performance: + +- **Comprehensive Test Suite** + - Unit tests for all components + - Integration tests for end-to-end validation + - Conformance tests against reference implementation + +- **Benchmarking Framework** + - Performance measurement tools + - Comparison with PyTorch implementation + - Memory usage analysis + +- **Platform Benchmarks** + - Apple Silicon performance + - x86_64 performance + - NVIDIA GPU performance + +- **Fine-Tuning** + - Performance bottleneck identification + - Targeted optimizations + - Final parameter tuning + +## Why DeepSeek V3 in Zig? + +The migration of DeepSeek V3 to Zig represents a significant advancement in language model implementation. By leveraging Zig's unique features, particularly compile-time metaprogramming and fine-grained memory control, we aim to create a highly optimized implementation that outperforms the original Python/PyTorch version significantly while maintaining flexibility and ease of use. + +Key advantages of the Zig implementation include: + +1. **Superior Performance** + - Compile-time specialization eliminates runtime overhead + - Direct hardware access for maximum efficiency + - Zero-cost abstractions for clean yet fast code + +2. **Memory Efficiency** + - Explicit allocation strategies tailored to LLM workloads + - Reduced memory fragmentation + - Lower overall memory footprint + +3. **Reliability** + - Comprehensive error handling + - No runtime exceptions + - Deterministic resource cleanup + +4. **Portability** + - Cross-platform support with optimized backends + - Consistent behavior across environments + - Single codebase for all target platforms + +The resulting system will be particularly well-suited for deployment on resource-constrained devices and will provide superior performance on all platforms. This architectural approach sets the foundation for future innovations in large language model deployment. \ No newline at end of file diff --git a/dzv3-logo.svg b/dzv3-logo.svg new file mode 100644 index 0000000..849ea60 --- /dev/null +++ b/dzv3-logo.svg @@ -0,0 +1,31 @@ + + + + + + + + + + + + + + + + + + + + + + + + DeepZig V3 + + + + + High-performance LLM inference engine written in Zig + +