mirror of
https://github.com/deepseek-ai/DeepSeek-V3.git
synced 2025-07-05 07:51:38 -04:00
74 lines
2.9 KiB
Markdown
74 lines
2.9 KiB
Markdown
# DeepSeek-V3 Weight File Documentation
|
|
|
|
|
|
## BF16 SFT to DeepSeek block-scale weight quantization for inference
|
|
|
|
`VLLM` community member `llvm-project/llm-compressor` is working on supporting integrating block-scale quantization [PR#1475](https://github.com/vllm-project/llm-compressor/issues/1475). vLLM and SGLang already support deepseek FP8 Dynamic (act) quantization format.
|
|
|
|
DeepSeek weights [README_WEIGHTS.md](../README_WEIGHTS.md) needs to quantize weights statically with block-wise max reduction kernel operator. And we find adding bf16 quantization support with huggingface safetensors are handy.
|
|
|
|
To successfully produce fp8 (ocp-e4m3 fmt) quantization model successfully, quite unlike mixtral, we need to ignore `lm_head`, while cast all 61 MoE up, down, gate projection weights to fp8 format:
|
|
|
|
|
|
```
|
|
quantize_config = BaseQuantizeConfig(
|
|
quant_method="fp8",
|
|
activation_scheme="dynamic",
|
|
ignore_patterns=[".*lm_head"],
|
|
)
|
|
```
|
|
|
|
SGLang and vLLM project has taken care of activation dynamic quanization, which is the basically for activations equivalent to 128-group (or 4 x E8M0 32-group) quantization. But block-scale quantization for weights, requires non-K dimension tiling of inputs and outputs.
|
|
|
|
For MxN weights, we will produce `ceil(M, BLOCK_SIZE_M) x ciel(N, BLOCK_SIZE_N)` blocks in compute and inverse scalars which can later persisted in kernel. This significantly reduces number of weights needed.
|
|
|
|
#### Step by step
|
|
|
|
###### 1. Perform quantization
|
|
|
|
```
|
|
python bf16_cast_fp8.py \
|
|
--input-bf16-hf-path "${DIST_FS_ROOT}/DeepSeek-sft-bf16/" \
|
|
--output-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate" \
|
|
--input-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-V3-0324"
|
|
```
|
|
|
|
`--input-fp8-hf-path` is used to fetch weights scalars in original DeepSeek V3 repo. The file can be generated by
|
|
|
|
```
|
|
python bf16_cast_fp8.py \
|
|
--input-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-V3-0324"
|
|
```
|
|
|
|
The script creates fp8 safetensors inside `${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate` and `${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate/model.safetensors.index.json`
|
|
|
|
###### 2. Copy the following configs from your bf16 checkpoint to the folder
|
|
|
|
```
|
|
DEST=${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate
|
|
|
|
cp $BF16_CHECK_POINT/config.json
|
|
cp $BF16_CHECK_POINT/configuration_deepseek.py ${DEST}/
|
|
cp $BF16_CHECK_POINT/modeling_deepseek.py ${DEST}/
|
|
cp $BF16_CHECK_POINT/tokenizer.json ${DEST}/
|
|
cp $BF16_CHECK_POINT/tokenizer_config.json ${DEST}/
|
|
```
|
|
|
|
Make sure you have add the following dict is added into config.json :
|
|
|
|
```
|
|
"quantization_config": {
|
|
"activation_scheme": "dynamic",
|
|
"fmt": "e4m3",
|
|
"quant_method": "fp8",
|
|
"weight_block_size": [128, 128],
|
|
"ignored_layers": [".*lm_head"],
|
|
}
|
|
```
|
|
|
|
We will make simple class to automate the process later.
|
|
|
|
## BF16 upgrade training/inference
|
|
|
|
This was originally created for inferenc eon non-fp8 capable chips. See details from [fp8_cast_bf16.py](./fp8_cast_bf16.py).
|