intro

2025-06-13 05:03:49 -04:00 · 2025-03-14 13:23:44 -05:00 · 2025-03-14 13:23:44 -05:00 · e4f555ca6f
commit e4f555ca6f
parent 592fd5daf8
1 changed files with 165 additions and 0 deletions
--- a/INTRO.rst
+++ b/INTRO.rst
@ -0,0 +1,165 @@
+1. Introduction
+===============
+
+In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and
+evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards
+Artificial General Intelligence (AGI). Beyond closed-source models, open-source models,
+including DeepSeek series (DeepSeek-AI, 2024a,b,c; Guo et al., 2024), LLaMA series (AI\@Meta,
+2024a,b; Touvron et al., 2023a,b), Qwen series (Qwen, 2023, 2024a,b), and Mistral series (Jiang
+et al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with
+their closed-source counterparts. To further push the boundaries of open-source model capabilities,
+we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE)
+model with 671B parameters, of which 37B are activated for each token.
+
+With a forward-looking perspective, we consistently strive for strong model performance
+and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head
+Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai
+et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek-
+V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance
+while achieving efficient training and inference. Beyond the basic architecture, we implement
+two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers
+an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of
+minimizing the adverse impact on model performance that arises from the effort to encourage
+load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective,
+which we have observed to enhance the overall performance on evaluation benchmarks.
+
+In order to achieve efficient training, we support the FP8 mixed precision training and
+implement comprehensive optimizations for the training framework. Low-precision training
+has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al.,
+2019; Narang et al., 2017; Peng et al., 2023b), its evolution being closely tied to advancements in
+hardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this
+work, we introduce an FP8 mixed precision training framework and, for the first time, validate
+its effectiveness on an extremely large-scale model. Through the support for FP8 computation
+and storage, we achieve both accelerated training and reduced GPU memory usage. As for
+the training framework, we design the DualPipe algorithm for efficient pipeline parallelism,
+which has fewer pipeline bubbles and hides most of the communication during training through
+computation-communication overlap. This overlap ensures that, as the model further scales up,
+as long as we maintain a constant computation-to-communication ratio, we can still employ
+fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
+In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize
+InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory
+footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.
+Combining these efforts, we achieve high training efficiency.
+
+During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The
+pre-training process is remarkably stable. Throughout the entire training process, we did not
+encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage
+context length extension for DeepSeek-V3. In the first stage, the maximum context length is
+extended to 32K, and in the second stage, it is further extended to 128K. Following this, we
+conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
+on the base model of DeepSeek-V3, to align it with human preferences and further unlock its
+potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-
+R1 series of models, and meanwhile carefully maintain the balance between model accuracy
+and generation length.
+
+.. raw:: html
+
+    <div align="center">
+    <table border="1" cellpadding="0" cellspacing="0">
+    <tbody>
+    <tr>
+    <th align="left">Training Costs</th>
+    <th>Pre-Training</th>
+    <th>Context Extension</th>
+    <th>Post-Training</th>
+    <th>Total</th>
+    </tr>
+    <tr>
+    <td align="left">in H800 GPU Hours</td>
+    <td align="center">2664K</td>
+    <td align="center">119K</td>
+    <td align="center">5K</td>
+    <td align="center">2788K</td>
+    </tr>
+    <tr>
+    <td align="left">in USD</td>
+    <td align="center">$5.328M</td>
+    <td align="center">$0.238M</td>
+    <td align="center">$0.01M</td>
+    <td align="center">$5.576M</td>
+    </tr>
+    </tbody>
+    </table>
+    Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour.
+    </div>
+    <br>
+
+We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical
+training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the
+strongest open-source base model currently available, especially in code and math. Its chat
+version also outperforms other open-source models and achieves performance comparable to
+leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard
+and open-ended benchmarks.
+
+Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in
+Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware.
+During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K
+H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-
+training stage is completed in less than two months and costs 2664K GPU hours. Combined
+with 119K GPU hours for the context length extension and 5K GPU hours for post-training,
+DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of
+the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that
+the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs
+associated with prior research and ablation experiments on architectures, algorithms, or data.
+Our main contribution includes:
+
+**Architecture: Innovative Load Balancing Strategy and Training Objective**
+
+* On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free
+  strategy for load balancing, which minimizes the performance degradation that arises
+  from encouraging load balancing.
+* We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model
+  performance. It can also be used for speculative decoding for inference acceleration.
+
+**Pre-Training: Towards Ultimate Training Efficiency**
+
+* We design an FP8 mixed precision training framework and, for the first time, validate the
+  feasibility and effectiveness of FP8 training on an extremely large-scale model.
+* Through the co-design of algorithms, frameworks, and hardware, we overcome the
+  communication bottleneck in cross-node MoE training, achieving near-full computation-
+  communication overlap. This significantly enhances our training efficiency and reduces the
+  training costs, enabling us to further scale up the model size without additional overhead.
+* At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of
+  DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model.
+  The subsequent training stages after pre-training require only 0.1M GPU hours.
+
+**Post-Training: Knowledge Distillation from DeepSeek-R1**
+
+* We introduce an innovative methodology to distill reasoning capabilities from the long-
+  Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models,
+  into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the
+  verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its
+  reasoning performance. Meanwhile, we also maintain control over the output style and
+  length of DeepSeek-V3.
+
+**Summary of Core Evaluation Results**
+
+* **Knowledge:** (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA,
+  DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9
+  on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source
+  models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source
+  and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3
+  demonstrates superior performance among open-source models on both SimpleQA and
+  Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual
+  knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese
+  SimpleQA), highlighting its strength in Chinese factual knowledge.
+* **Code, Math, and Reasoning:** (1) DeepSeek-V3 achieves state-of-the-art performance on
+  math-related benchmarks among all non-long-CoT open-source and closed-source models.
+  Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500,
+  demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks,
+  DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,
+  such as LiveCodeBench, solidifying its position as the leading model in this domain. For
+  engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,
+  it still outpaces all other models by a significant margin, demonstrating its competitiveness
+  across diverse technical benchmarks.
+
+In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3
+model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing
+our compute clusters, the training framework, the support for FP8 training, the inference
+deployment strategy, and our suggestions on future hardware design. Next, we describe our
+pre-training process, including the construction of training data, hyper-parameter settings, long-
+context extension techniques, the associated evaluations, as well as some discussions (Section 4).
+Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT),
+Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly,
+we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential
+directions for future research (Section 6).