mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-02-23 06:08:58 -05:00
- Enhanced sentence structure for better clarity and smoother flow. - Adjusted wording and phrasing to improve accuracy and professionalism. - Optimized the organization of information for better readability, especially in the sections related to parameters and technical details. - Refined formatting and sectioning of the documentation for easier navigation and comprehension.
3.6 KiB
3.6 KiB
DeepSeek-V3 Weight File Documentation
New Fields in config.json
- model_type: Specifies the model type, which is now set to
deepseek_v3
in this release. - num_nextn_predict_layers: Defines the number of Multi-Token Prediction (MTP) Modules. The open-sourced V3 weights contain 1 MTP Module.
- quantization_config: Details the configuration for FP8 quantization.
Weight File Structure Overview
The DeepSeek-V3 weight file is divided into two primary components: Main Model Weights and MTP Modules.
1. Main Model Weights
- Composition:
- Includes input/output embedding layers and a full set of 61 Transformer hidden layers.
- Parameter Count:
- Total parameters: 671B
- Activation parameters: 36.7B (which includes 0.9B for Embedding and 0.9B for the Output Head).
Structural Details
- Embedding Layer:
model.embed_tokens.weight
- Transformer Hidden Layers:
- From
model.layers.0
tomodel.layers.60
, which correspond tonum_hidden_layers
layers.
- From
- Output Layer:
model.norm.weight
lm_head.weight
2. Multi-Token Prediction (MTP) Modules
- Composition:
- These modules are determined by the
num_nextn_predict_layers
parameter. In this model, the value is set to 1.
- These modules are determined by the
- Parameter Count:
- Parameters: 11.5B unique parameters (excluding the shared 0.9B Embedding and 0.9B Output Head).
- Activation parameters: 2.4B (including the shared 0.9B Embedding and 0.9B Output Head).
Structural Details
- embed_tokens: Shares parameters with the Main Model’s Embedding layer.
- enorm & hnorm: RMSNorm parameters used for speculative decoding.
- eh_proj: Parameters used for dimensionality reduction of the normalized outputs.
- Additional Transformer Hidden Layer:
model.layers.61.self_attn & mlp
(these are structured the same as the Main Model hidden layers).
- shared_head: Shares parameters with the Output Head of the Main Model.
Layer Loading Rules
- Main Model Weights: These are loaded according to the
num_hidden_layers
field inconfig.json
. - MTP Modules: These are loaded using the
num_nextn_predict_layers
field, with MTP layer IDs appended directly after the Main Model’s hidden layers. For example:- With
num_hidden_layers = 61
andnum_nextn_predict_layers = 1
, the MTP Module layer ID will be61
.
- With
FP8 Weight Documentation
DeepSeek-V3 natively supports the FP8 weight format with 128x128 block scaling.
FP8 Configuration
The FP8 weight file introduces a quantization_config
field, which defines the quantization method. Below is an example of the configuration:
"quantization_config": {
"activation_scheme": "dynamic",
"fmt": "e4m3",
"quant_method": "fp8",
"weight_block_size": [128, 128]
}
- Quantization Format:
- Format type:
fp8
ande4m3
(aligned withtorch.float8_e4m3fn
). - Weight block size:
128x128
.
- Format type:
- Activation Quantization Scheme:
- Uses dynamic activation quantization (
dynamic
).
- Uses dynamic activation quantization (
Dequantization Method
The FP8 weight file includes a weight_scale_inv
field, which stores the dequantization scale for each weight block.
- Storage Format: Stored as a
float32 Tensor
, alongside the weight data. - Dequantization Formula:
- If a weight block is not aligned to 128, it is zero-padded to 128 before calculating the scale. The padded portion is discarded after quantization.
- Dequantization is performed using the formula:
(128x128 weight block) * weight_scale_inv
.
This dequantization process enables runtime operations to apply online quantization on a per-token, per-128-channel basis.