DeepSeek-V3/inference
yiakwy-xpu-ml-framework-team 7813704f78 add readme
2025-07-01 20:04:39 +08:00
..
auto_fp8 add support block-wise quant from bf16 2025-07-01 07:59:03 +08:00
configs Release DeepSeek-V3 2024-12-26 19:01:57 +08:00
bf16_cast_fp8.py update script and verified correctness 2025-07-01 17:40:04 +08:00
convert.py clarify assertion error 2025-01-28 13:16:54 +01:00
fp8_cast_bf16.py Enhance documentation and update .gitignore for model conversion scripts 2025-01-05 18:18:18 +00:00
generate.py clarify assertion error 2025-01-28 13:16:54 +01:00
kernel.py add support block-wise quant from bf16 2025-07-01 07:59:03 +08:00
model.py modify the explanation of MLA 2025-02-26 17:07:39 +08:00
README.md add readme 2025-07-01 20:04:39 +08:00
requirements.txt Release DeepSeek-V3 2024-12-26 19:01:57 +08:00

DeepSeek-V3 Weight File Documentation

BF16 SFT to DeepSeek block-scale weight quantization for inference

VLLM community member llvm-project/llm-compressor is working on supporting integrating block-scale quantization PR#1475. vLLM and SGLang already support deepseek FP8 Dynamic (act) quantization format.

DeepSeek weights README_WEIGHTS.md needs to quantize weights statically with block-wise max reduction kernel operator. And we find adding bf16 quantization support with huggingface safetensors are handy.

To successfully produce fp8 (ocp-e4m3 fmt) quantization model successfully, quite unlike mixtral, we need to ignore lm_head, while cast all 61 MoE up, down, gate projection weights to fp8 format:

quantize_config = BaseQuantizeConfig(
    quant_method="fp8",
    activation_scheme="dynamic",
    ignore_patterns=[".*lm_head"],
)

SGLang and vLLM project has taken care of activation dynamic quanization, which is the basically for activations equivalent to 128-group (or 4 x E8M0 32-group) quantization. But block-scale quantization for weights, requires non-K dimension tiling of inputs and outputs.

For MxN weights, we will produce ceil(M, BLOCK_SIZE_M) x ciel(N, BLOCK_SIZE_N) blocks in compute and inverse scalars which can later persisted in kernel. This significantly reduces number of weights needed.

Step by step

1. Perform quantization
python bf16_cast_fp8.py \
  --input-bf16-hf-path "${DIST_FS_ROOT}/DeepSeek-sft-bf16/" \
  --output-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate" \
   --input-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-V3-0324"

--input-fp8-hf-path is used to fetch weights scalars in original DeepSeek V3 repo. The file can be generated by

python bf16_cast_fp8.py \
	--input-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-V3-0324"

The script creates fp8 safetensors inside ${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate and ${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate/model.safetensors.index.json

2. Copy the following configs from your bf16 checkpoint to the folder
DEST=${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate

cp $BF16_CHECK_POINT/config.json
cp $BF16_CHECK_POINT/configuration_deepseek.py ${DEST}/
cp $BF16_CHECK_POINT/modeling_deepseek.py ${DEST}/
cp $BF16_CHECK_POINT/tokenizer.json ${DEST}/
cp $BF16_CHECK_POINT/tokenizer_config.json ${DEST}/

Make sure you have add the following dict is added into config.json :

"quantization_config": {
    "activation_scheme": "dynamic",
    "fmt": "e4m3",
    "quant_method": "fp8",
    "weight_block_size": [128, 128],
    "ignored_layers": [".*lm_head"],
}

We will make simple class to automate the process later.

BF16 upgrade training/inference

This was originally created for inferenc eon non-fp8 capable chips. See details from fp8_cast_bf16.py.