mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-04-19 10:08:59 -04:00
intro
This commit is contained in:
parent
592fd5daf8
commit
e4f555ca6f
165
INTRO.rst
Normal file
165
INTRO.rst
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
1. Introduction
|
||||||
|
===============
|
||||||
|
|
||||||
|
In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and
|
||||||
|
evolution (Anthropic, 2024; Google, 2024; OpenAI, 2024a), progressively diminishing the gap towards
|
||||||
|
Artificial General Intelligence (AGI). Beyond closed-source models, open-source models,
|
||||||
|
including DeepSeek series (DeepSeek-AI, 2024a,b,c; Guo et al., 2024), LLaMA series (AI\@Meta,
|
||||||
|
2024a,b; Touvron et al., 2023a,b), Qwen series (Qwen, 2023, 2024a,b), and Mistral series (Jiang
|
||||||
|
et al., 2023; Mistral, 2024), are also making significant strides, endeavoring to close the gap with
|
||||||
|
their closed-source counterparts. To further push the boundaries of open-source model capabilities,
|
||||||
|
we scale up our models and introduce DeepSeek-V3, a large Mixture-of-Experts (MoE)
|
||||||
|
model with 671B parameters, of which 37B are activated for each token.
|
||||||
|
|
||||||
|
With a forward-looking perspective, we consistently strive for strong model performance
|
||||||
|
and economical costs. Therefore, in terms of architecture, DeepSeek-V3 still adopts Multi-head
|
||||||
|
Latent Attention (MLA) (DeepSeek-AI, 2024c) for efficient inference and DeepSeekMoE (Dai
|
||||||
|
et al., 2024) for cost-effective training. These two architectures have been validated in DeepSeek-
|
||||||
|
V2 (DeepSeek-AI, 2024c), demonstrating their capability to maintain robust model performance
|
||||||
|
while achieving efficient training and inference. Beyond the basic architecture, we implement
|
||||||
|
two additional strategies to further enhance the model capabilities. Firstly, DeepSeek-V3 pioneers
|
||||||
|
an auxiliary-loss-free strategy (Wang et al., 2024a) for load balancing, with the aim of
|
||||||
|
minimizing the adverse impact on model performance that arises from the effort to encourage
|
||||||
|
load balancing. Secondly, DeepSeek-V3 employs a multi-token prediction training objective,
|
||||||
|
which we have observed to enhance the overall performance on evaluation benchmarks.
|
||||||
|
|
||||||
|
In order to achieve efficient training, we support the FP8 mixed precision training and
|
||||||
|
implement comprehensive optimizations for the training framework. Low-precision training
|
||||||
|
has emerged as a promising solution for efficient training (Dettmers et al., 2022; Kalamkar et al.,
|
||||||
|
2019; Narang et al., 2017; Peng et al., 2023b), its evolution being closely tied to advancements in
|
||||||
|
hardware capabilities (Luo et al., 2024; Micikevicius et al., 2022; Rouhani et al., 2023a). In this
|
||||||
|
work, we introduce an FP8 mixed precision training framework and, for the first time, validate
|
||||||
|
its effectiveness on an extremely large-scale model. Through the support for FP8 computation
|
||||||
|
and storage, we achieve both accelerated training and reduced GPU memory usage. As for
|
||||||
|
the training framework, we design the DualPipe algorithm for efficient pipeline parallelism,
|
||||||
|
which has fewer pipeline bubbles and hides most of the communication during training through
|
||||||
|
computation-communication overlap. This overlap ensures that, as the model further scales up,
|
||||||
|
as long as we maintain a constant computation-to-communication ratio, we can still employ
|
||||||
|
fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.
|
||||||
|
In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize
|
||||||
|
InfiniBand (IB) and NVLink bandwidths. Furthermore, we meticulously optimize the memory
|
||||||
|
footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.
|
||||||
|
Combining these efforts, we achieve high training efficiency.
|
||||||
|
|
||||||
|
During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The
|
||||||
|
pre-training process is remarkably stable. Throughout the entire training process, we did not
|
||||||
|
encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage
|
||||||
|
context length extension for DeepSeek-V3. In the first stage, the maximum context length is
|
||||||
|
extended to 32K, and in the second stage, it is further extended to 128K. Following this, we
|
||||||
|
conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL)
|
||||||
|
on the base model of DeepSeek-V3, to align it with human preferences and further unlock its
|
||||||
|
potential. During the post-training stage, we distill the reasoning capability from the DeepSeek-
|
||||||
|
R1 series of models, and meanwhile carefully maintain the balance between model accuracy
|
||||||
|
and generation length.
|
||||||
|
|
||||||
|
.. raw:: html
|
||||||
|
|
||||||
|
<div align="center">
|
||||||
|
<table border="1" cellpadding="0" cellspacing="0">
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<th align="left">Training Costs</th>
|
||||||
|
<th>Pre-Training</th>
|
||||||
|
<th>Context Extension</th>
|
||||||
|
<th>Post-Training</th>
|
||||||
|
<th>Total</th>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td align="left">in H800 GPU Hours</td>
|
||||||
|
<td align="center">2664K</td>
|
||||||
|
<td align="center">119K</td>
|
||||||
|
<td align="center">5K</td>
|
||||||
|
<td align="center">2788K</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td align="left">in USD</td>
|
||||||
|
<td align="center">$5.328M</td>
|
||||||
|
<td align="center">$0.238M</td>
|
||||||
|
<td align="center">$0.01M</td>
|
||||||
|
<td align="center">$5.576M</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
Table 1 | Training costs of DeepSeek-V3, assuming the rental price of H800 is $2 per GPU hour.
|
||||||
|
</div>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
We evaluate DeepSeek-V3 on a comprehensive array of benchmarks. Despite its economical
|
||||||
|
training costs, comprehensive evaluations reveal that DeepSeek-V3-Base has emerged as the
|
||||||
|
strongest open-source base model currently available, especially in code and math. Its chat
|
||||||
|
version also outperforms other open-source models and achieves performance comparable to
|
||||||
|
leading closed-source models, including GPT-4o and Claude-3.5-Sonnet, on a series of standard
|
||||||
|
and open-ended benchmarks.
|
||||||
|
|
||||||
|
Lastly, we emphasize again the economical training costs of DeepSeek-V3, summarized in
|
||||||
|
Table 1, achieved through our optimized co-design of algorithms, frameworks, and hardware.
|
||||||
|
During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K
|
||||||
|
H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pre-
|
||||||
|
training stage is completed in less than two months and costs 2664K GPU hours. Combined
|
||||||
|
with 119K GPU hours for the context length extension and 5K GPU hours for post-training,
|
||||||
|
DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of
|
||||||
|
the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M. Note that
|
||||||
|
the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs
|
||||||
|
associated with prior research and ablation experiments on architectures, algorithms, or data.
|
||||||
|
Our main contribution includes:
|
||||||
|
|
||||||
|
**Architecture: Innovative Load Balancing Strategy and Training Objective**
|
||||||
|
|
||||||
|
* On top of the efficient architecture of DeepSeek-V2, we pioneer an auxiliary-loss-free
|
||||||
|
strategy for load balancing, which minimizes the performance degradation that arises
|
||||||
|
from encouraging load balancing.
|
||||||
|
* We investigate a Multi-Token Prediction (MTP) objective and prove it beneficial to model
|
||||||
|
performance. It can also be used for speculative decoding for inference acceleration.
|
||||||
|
|
||||||
|
**Pre-Training: Towards Ultimate Training Efficiency**
|
||||||
|
|
||||||
|
* We design an FP8 mixed precision training framework and, for the first time, validate the
|
||||||
|
feasibility and effectiveness of FP8 training on an extremely large-scale model.
|
||||||
|
* Through the co-design of algorithms, frameworks, and hardware, we overcome the
|
||||||
|
communication bottleneck in cross-node MoE training, achieving near-full computation-
|
||||||
|
communication overlap. This significantly enhances our training efficiency and reduces the
|
||||||
|
training costs, enabling us to further scale up the model size without additional overhead.
|
||||||
|
* At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of
|
||||||
|
DeepSeek-V3 on 14.8T tokens, producing the currently strongest open-source base model.
|
||||||
|
The subsequent training stages after pre-training require only 0.1M GPU hours.
|
||||||
|
|
||||||
|
**Post-Training: Knowledge Distillation from DeepSeek-R1**
|
||||||
|
|
||||||
|
* We introduce an innovative methodology to distill reasoning capabilities from the long-
|
||||||
|
Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models,
|
||||||
|
into standard LLMs, particularly DeepSeek-V3. Our pipeline elegantly incorporates the
|
||||||
|
verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its
|
||||||
|
reasoning performance. Meanwhile, we also maintain control over the output style and
|
||||||
|
length of DeepSeek-V3.
|
||||||
|
|
||||||
|
**Summary of Core Evaluation Results**
|
||||||
|
|
||||||
|
* **Knowledge:** (1) On educational benchmarks such as MMLU, MMLU-Pro, and GPQA,
|
||||||
|
DeepSeek-V3 outperforms all other open-source models, achieving 88.5 on MMLU, 75.9
|
||||||
|
on MMLU-Pro, and 59.1 on GPQA. Its performance is comparable to leading closed-source
|
||||||
|
models like GPT-4o and Claude-Sonnet-3.5, narrowing the gap between open-source
|
||||||
|
and closed-source models in this domain. (2) For factuality benchmarks, DeepSeek-V3
|
||||||
|
demonstrates superior performance among open-source models on both SimpleQA and
|
||||||
|
Chinese SimpleQA. While it trails behind GPT-4o and Claude-Sonnet-3.5 in English factual
|
||||||
|
knowledge (SimpleQA), it surpasses these models in Chinese factual knowledge (Chinese
|
||||||
|
SimpleQA), highlighting its strength in Chinese factual knowledge.
|
||||||
|
* **Code, Math, and Reasoning:** (1) DeepSeek-V3 achieves state-of-the-art performance on
|
||||||
|
math-related benchmarks among all non-long-CoT open-source and closed-source models.
|
||||||
|
Notably, it even outperforms o1-preview on specific benchmarks, such as MATH-500,
|
||||||
|
demonstrating its robust mathematical reasoning capabilities. (2) On coding-related tasks,
|
||||||
|
DeepSeek-V3 emerges as the top-performing model for coding competition benchmarks,
|
||||||
|
such as LiveCodeBench, solidifying its position as the leading model in this domain. For
|
||||||
|
engineering-related tasks, while DeepSeek-V3 performs slightly below Claude-Sonnet-3.5,
|
||||||
|
it still outpaces all other models by a significant margin, demonstrating its competitiveness
|
||||||
|
across diverse technical benchmarks.
|
||||||
|
|
||||||
|
In the remainder of this paper, we first present a detailed exposition of our DeepSeek-V3
|
||||||
|
model architecture (Section 2). Subsequently, we introduce our infrastructures, encompassing
|
||||||
|
our compute clusters, the training framework, the support for FP8 training, the inference
|
||||||
|
deployment strategy, and our suggestions on future hardware design. Next, we describe our
|
||||||
|
pre-training process, including the construction of training data, hyper-parameter settings, long-
|
||||||
|
context extension techniques, the associated evaluations, as well as some discussions (Section 4).
|
||||||
|
Thereafter, we discuss our efforts on post-training, which include Supervised Fine-Tuning (SFT),
|
||||||
|
Reinforcement Learning (RL), the corresponding evaluations, and discussions (Section 5). Lastly,
|
||||||
|
we conclude this work, discuss existing limitations of DeepSeek-V3, and propose potential
|
||||||
|
directions for future research (Section 6).
|
Loading…
Reference in New Issue
Block a user