mirror of https://github.com/deepseek-ai/DeepSeek-V3.git synced 2025-07-04 23:41:37 -04:00

i have a feeling this might need to be mirrored

Go to file

Pushkar Kumar Saini d9b2e677b1 Update README.md Key Improvements Made Improved Clarity and Conciseness: Simplified technical jargon where possible while retaining precision, making the content more accessible. Enhanced Formatting: Added consistent section headers, improved table readability, and included visual cues (e.g., emojis, bolding, and captions for images). Streamlined Instructions: Consolidated and clarified setup steps for local deployment, ensuring they are beginner-friendly yet detailed. Professional Tone: Refined the tone to balance technical accuracy with approachability, appealing to both researchers and developers. Visual Enhancements: Added alt text for images and improved table alignment for better readability. Consistency: Standardized terminology (e.g., "DeepSeek-V3" instead of varying names) and ensured uniform formatting for links and code blocks. Accessibility: Improved section navigation with clear anchors and a more intuitive Table of Contents.		2025-05-31 16:45:25 +05:30
.github	chore: add stale issue management configuration	2025-02-08 15:12:09 +08:00
figures	Release DeepSeek-V3	2024-12-26 19:01:57 +08:00
inference	Merge pull request #666 from codinglover222/deepseek-doc-fix	2025-04-09 09:50:40 +08:00
.gitignore	Enhance documentation and update .gitignore for model conversion scripts	2025-01-05 18:18:18 +00:00
LICENSE-CODE	Release DeepSeek-V3	2024-12-26 19:01:57 +08:00
LICENSE-MODEL	Release DeepSeek-V3	2024-12-26 19:01:57 +08:00
README_WEIGHTS.md	Release DeepSeek-V3	2024-12-26 19:01:57 +08:00
README.md	Update README.md	2025-05-31 16:45:25 +05:30

README.md

DeepSeek-V3

📄 Read the Paper

Introduction
Model Overview
Model Downloads
Performance Benchmarks
Chat & API Access
Running Locally
Licensing
Citation
Contact Us

1. Introduction

DeepSeek-V3 is a state-of-the-art Mixture-of-Experts (MoE) language model with 671 billion total parameters, activating 37 billion parameters per token. Building on the efficient architecture of DeepSeek-V2, it introduces cutting-edge innovations, including Multi-head Latent Attention (MLA), DeepSeekMoE, an auxiliary-loss-free load balancing strategy, and a Multi-Token Prediction (MTP) training objective. These advancements deliver exceptional performance and scalability.

Pre-trained on 14.8 trillion high-quality, diverse tokens, followed by Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), DeepSeek-V3 achieves top-tier performance, surpassing other open-source models and rivaling leading closed-source models. Remarkably, its full training required only 2.788 million H800 GPU hours, with a stable process free of loss spikes or rollbacks.

Benchmark performance showcasing DeepSeek-V3’s capabilities.

2. Model Overview

Architecture: Optimized for Efficiency

Innovative Load Balancing: An auxiliary-loss-free strategy minimizes performance degradation while optimizing resource allocation.
Multi-Token Prediction (MTP): Enhances model performance and supports speculative decoding for faster inference.
DeepSeekMoE & MLA: Leverages the proven efficiency of DeepSeek-V2’s architecture for large-scale MoE models.

Pre-Training: Unprecedented Efficiency

FP8 Mixed Precision: Validates FP8 training for large-scale models, reducing memory and computational overhead.
Optimized Communication: Overcomes cross-node MoE training bottlenecks, achieving near-complete computation-communication overlap.
Cost-Effective Scaling: Pre-trained on 14.8T tokens using only 2.664M H800 GPU hours, with post-training requiring just 0.1M GPU hours.

Post-Training: Advanced Reasoning

Knowledge Distillation: Integrates reasoning capabilities from DeepSeek-R1’s long-Chain-of-Thought (CoT) model, enhancing DeepSeek-V3’s reasoning while maintaining control over output style and length.

3. 🚀 Model Downloads

Access DeepSeek-V3 models, pre-trained and fine-tuned for exceptional performance:

Model Name	Total Parameters	Activated Parameters	Context Length	Download
DeepSeek-V3-Base	671B	37B	128K	🤗 Hugging Face
DeepSeek-V3	671B	37B	128K	🤗 Hugging Face

Note

: The total model size is 685B, including 671B main model weights and 14B Multi-Token Prediction (MTP) module weights.

For detailed instructions on running the model locally, see Section 6: Running Locally. Developers can explore README_WEIGHTS.md for insights into model weights and MTP modules. Community contributions to MTP support are welcome!

4. Performance Benchmarks

Base Model: Standard Benchmarks

DeepSeek-V3 excels across a wide range of tasks, particularly in math and code:

Category	Benchmark (Metric)	# Shots	DeepSeek-V2	Qwen2.5 72B	LLaMA3.1 405B	DeepSeek-V3
Architecture	-	-	MoE	Dense	Dense	MoE
Activated Params	-	-	21B	72B	405B	37B
Total Params	-	-	236B	72B	405B	671B
English	Pile-test (BPB)	-	0.606	0.638	0.542	0.548
	BBH (EM)	3-shot	78.8	79.8	82.9	87.5
	MMLU (Acc.)	5-shot	78.4	85.0	84.4	87.1
	MMLU-Pro (Acc.)	5-shot	51.4	58.3	52.8	64.4
	DROP (F1)	3-shot	80.4	80.6	86.0	89.0
Code	HumanEval (Pass@1)	0-shot	43.3	53.0	54.9	65.2
	MBPP (Pass@1)	3-shot	65.0	72.6	68.4	75.4
	LiveCodeBench-Base (Pass@1)	3-shot	11.6	12.9	15.5	19.4
Math	GSM8K (EM)	8-shot	81.6	88.3	83.5	89.3
	MATH (EM)	4-shot	43.4	54.4	49.0	61.6
	MGSM (EM)	8-shot	63.6	76.2	69.9	79.8
Chinese	C-Eval (Acc.)	5-shot	81.4	89.2	72.5	90.1
	CMMLU (Acc.)	5-shot	84.0	89.5	73.7	88.8
Multilingual	MMMLU-non-English (Acc.)	5-shot	64.0	74.8	73.8	79.4

Note

: Bold indicates the best results. Scores within 0.3 points are considered equivalent. For detailed results, refer to the technical paper.

Context Window

DeepSeek-V3 supports a 128K context window, performing robustly in Needle In A Haystack (NIAH) tests across all lengths.

DeepSeek-V3’s performance across context window lengths.

Chat Model: Competitive with Frontier Models

DeepSeek-V3’s chat model rivals leading closed-source models:

Benchmark (Metric)	DeepSeek-V2.5	Qwen2.5 72B	LLaMA3.1 405B	Claude-3.5	GPT-4o	DeepSeek-V3
MMLU (EM)	80.6	85.3	88.6	88.3	87.2	88.5
MMLU-Pro (EM)	66.2	71.6	73.3	78.0	72.6	75.9
DROP (3-shot F1)	87.8	76.7	88.7	88.3	83.7	91.6
HumanEval-Mul (Pass@1)	77.4	77.3	77.2	81.7	80.5	82.6
MATH-500 (EM)	74.7	80.0	73.8	78.3	74.6	90.2
AIME 2024 (Pass@1)	16.7	23.3	23.3	16.0	9.3	39.2

Open-Ended Generation

DeepSeek-V3 excels in conversational tasks, outperforming other open-source models:

Model	Arena-Hard	AlpacaEval 2.0
DeepSeek-V2.5	76.2	50.5
Qwen2.5-72B	81.2	49.1
LLaMA3.1-405B	69.3	40.5
GPT-4o	80.4	51.1
Claude-Sonnet-3.5	85.2	52.0
DeepSeek-V3	85.5	70.0

Note

: AlpacaEval 2.0 uses length-controlled win rate.

5. Chat & API Access

Chat with DeepSeek-V3: Try it on our official platform: chat.deepseek.com.
API Access: Integrate DeepSeek-V3 via our OpenAI-compatible API: platform.deepseek.com.

6. Running Locally

DeepSeek-V3 can be deployed locally using a variety of frameworks and hardware configurations. Below are the supported options:

Supported Frameworks

DeepSeek-Infer Demo: Lightweight demo for FP8 and BF16 inference.
SGLang: Supports FP8/BF16 with MLA optimizations and multi-node tensor parallelism. MTP support is in progress (details).
LMDeploy: Efficient FP8/BF16 inference for local and cloud deployment (instructions).
TensorRT-LLM: Supports BF16 and INT4/8 quantization; FP8 support coming soon (custom branch).
vLLM: Supports FP8/BF16 with pipeline parallelism (instructions).
LightLLM: Single- and multi-node deployment for FP8/BF16 (instructions).
AMD GPU: Full FP8/BF16 support via SGLang.
Huawei Ascend NPU: BF16 support via MindIE (instructions).

Converting FP8 to BF16

DeepSeek-V3 uses FP8 weights by default. To convert to BF16:

cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights

Example: DeepSeek-Infer Demo

System Requirements

OS: Linux with Python 3.10 (Mac/Windows not supported).

Dependencies:

torch==2.4.1
triton==3.0.0
transformers==4.46.3
safetensors==0.4.5

Setup

Clone the repository:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference
pip install -r requirements.txt

Download model weights from Hugging Face and place them in /path/to/DeepSeek-V3.

Convert weights:

python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16

Run interactive chat:

torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200

Note

: Hugging Face Transformers support is under development.

7. Licensing

Code: Licensed under the MIT License.
Model: Governed by the DeepSeek Model License, supporting commercial use.

8. Citation

@misc{deepseekai2024deepseekv3technicalreport,
  title={DeepSeek-V3 Technical Report},
  author={DeepSeek-AI},
  year={2024},
  eprint={2412.19437},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2412.19437},
}

9. Contact Us

For questions, feedback, or support, please:

Raise an issue on GitHub.
Email us at service@deepseek.com.

README.md Unescape Escape