diff --git a/INTRO.rst b/INTRO.rst new file mode 100644 index 0000000..1992ba2 --- /dev/null +++ b/INTRO.rst @@ -0,0 +1,165 @@ +1. Introduction +=============== + +In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and +evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards +Artificial General Intelligence (AGI). Beyond closed-source models, open-source models, +including DeepSeek series (DeepSeek-AI, 2024a,b,c; Guo et al., 2024), LLaMA series (AI\@Meta, +2024a,b; Touvron et al., 2023a,b), Qwen series (Qwen, 2023, 2024a,b), and Mistral series (Jiang +et al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with +their closed-source counterparts. To further push the boundaries of open-source model capabilities, +we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE) +model with 671B parameters, of which 37B are activated for each token. + +With a forward-looking perspective, we consistently strive for strong model performance +and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head +Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai +et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek- +V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance +while achieving efficient training and inference. Beyond the basic architecture, we implement +two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers +an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of +minimizing the adverse impact on model performance that arises from the effort to encourage +load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective, +which we have observed to enhance the overall performance on evaluation benchmarks. + +In order to achieve efficient training, we support the FP8 mixed precision training and +implement comprehensive optimizations for the training framework. Low-precision training +has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al., +2019; Narang et al., 2017; Peng et al., 2023b), its evolution being closely tied to advancements in +hardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this +work, we introduce an FP8 mixed precision training framework and, for the first time, validate +its effectiveness on an extremely large-scale model. Through the support for FP8 computation +and storage, we achieve both accelerated training and reduced GPU memory usage. As for +the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, +which has fewer pipeline bubbles and hides most of the communication during training through +computation-communication overlap. This overlap ensures that, as the model further scales up, +as long as we maintain a constant computation-to-communication ratio, we can still employ +fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead. +In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize +InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory +footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism. +Combining these efforts, we achieve high training efficiency. + +During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The +pre-training process is remarkably stable. Throughout the entire training process, we did not +encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage +context length extension for DeepSeek-V3. In the first stage, the maximum context length is +extended to 32K, and in the second stage, it is further extended to 128K. Following this, we +conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) +on the base model of DeepSeek-V3, to align it with human preferences and further unlock its +potential. During the post-training stage, we distill the reasoning capability from the DeepSeek- +R1 series of models, and meanwhile carefully maintain the balance between model accuracy +and generation length. + +.. raw:: html + +
+ + + + + + + + + + + + + + + + + + + + + + + + +
Training CostsPre-TrainingContext ExtensionPost-TrainingTotal
in H800 GPU Hours2664K119K5K2788K
in USD$5.328M$0.238M$0.01M$5.576M
+ Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour. +
+
+ +We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical +training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the +strongest open-source base model currently available, especially in code and math. Its chat +version also outperforms other open-source models and achieves performance comparable to +leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard +and open-ended benchmarks. + +Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in +Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware. +During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K +H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre- +training stage is completed in less than two months and costs 2664K GPU hours. Combined +with 119K GPU hours for the context length extension and 5K GPU hours for post-training, +DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of +the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that +the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs +associated with prior research and ablation experiments on architectures, algorithms, or data. +Our main contribution includes: + +**Architecture: Innovative Load Balancing Strategy and Training Objective** + +* On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free + strategy for load balancing, which minimizes the performance degradation that arises + from encouraging load balancing. +* We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model + performance. It can also be used for speculative decoding for inference acceleration. + +**Pre-Training: Towards Ultimate Training Efficiency** + +* We design an FP8 mixed precision training framework and, for the first time, validate the + feasibility and effectiveness of FP8 training on an extremely large-scale model. +* Through the co-design of algorithms, frameworks, and hardware, we overcome the + communication bottleneck in cross-node MoE training, achieving near-full computation- + communication overlap. This significantly enhances our training efficiency and reduces the + training costs, enabling us to further scale up the model size without additional overhead. +* At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of + DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model. + The subsequent training stages after pre-training require only 0.1M GPU hours. + +**Post-Training: Knowledge Distillation from DeepSeek-R1** + +* We introduce an innovative methodology to distill reasoning capabilities from the long- + Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, + into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the + verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its + reasoning performance. Meanwhile, we also maintain control over the output style and + length of DeepSeek-V3. + +**Summary of Core Evaluation Results** + +* **Knowledge:** (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA, + DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9 + on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source + models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source + and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3 + demonstrates superior performance among open-source models on both SimpleQA and + Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual + knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese + SimpleQA), highlighting its strength in Chinese factual knowledge. +* **Code, Math, and Reasoning:** (1) DeepSeek-V3 achieves state-of-the-art performance on + math-related benchmarks among all non-long-CoT open-source and closed-source models. + Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500, + demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks, + DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks, + such as LiveCodeBench, solidifying its position as the leading model in this domain. For + engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5, + it still outpaces all other models by a significant margin, demonstrating its competitiveness + across diverse technical benchmarks. + +In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3 +model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing +our compute clusters, the training framework, the support for FP8 training, the inference +deployment strategy, and our suggestions on future hardware design. Next, we describe our +pre-training process, including the construction of training data, hyper-parameter settings, long- +context extension techniques, the associated evaluations, as well as some discussions (Section 4). +Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT), +Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly, +we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential +directions for future research (Section 6).