From 7813704f781d6489270053bde588c4ac80a2e051 Mon Sep 17 00:00:00 2001 From: yiakwy-xpu-ml-framework-team <961186938@qq.com> Date: Tue, 1 Jul 2025 20:04:39 +0800 Subject: [PATCH] add readme --- inference/README.md | 73 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) create mode 100644 inference/README.md diff --git a/inference/README.md b/inference/README.md new file mode 100644 index 0000000..005f8f7 --- /dev/null +++ b/inference/README.md @@ -0,0 +1,73 @@ +# DeepSeek-V3 Weight File Documentation + + +## BF16 SFT to DeepSeek block-scale weight quantization for inference + +`VLLM` community member `llvm-project/llm-compressor` is working on supporting integrating block-scale quantization [PR#1475](https://github.com/vllm-project/llm-compressor/issues/1475). vLLM and SGLang already support deepseek FP8 Dynamic (act) quantization format. + +DeepSeek weights [README_WEIGHTS.md](../README_WEIGHTS.md) needs to quantize weights statically with block-wise max reduction kernel operator. And we find adding bf16 quantization support with huggingface safetensors are handy. + +To successfully produce fp8 (ocp-e4m3 fmt) quantization model successfully, quite unlike mixtral, we need to ignore `lm_head`, while cast all 61 MoE up, down, gate projection weights to fp8 format: + + +``` +quantize_config = BaseQuantizeConfig( + quant_method="fp8", + activation_scheme="dynamic", + ignore_patterns=[".*lm_head"], +) +``` + +SGLang and vLLM project has taken care of activation dynamic quanization, which is the basically for activations equivalent to 128-group (or 4 x E8M0 32-group) quantization. But block-scale quantization for weights, requires non-K dimension tiling of inputs and outputs. + +For MxN weights, we will produce `ceil(M, BLOCK_SIZE_M) x ciel(N, BLOCK_SIZE_N)` blocks in compute and inverse scalars which can later persisted in kernel. This significantly reduces number of weights needed. + +#### Step by step + +###### 1. Perform quantization + +``` +python bf16_cast_fp8.py \ + --input-bf16-hf-path "${DIST_FS_ROOT}/DeepSeek-sft-bf16/" \ + --output-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate" \ + --input-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-V3-0324" +``` + +`--input-fp8-hf-path` is used to fetch weights scalars in original DeepSeek V3 repo. The file can be generated by + +``` +python bf16_cast_fp8.py \ + --input-fp8-hf-path "${DIST_FS_ROOT}/DeepSeek-V3-0324" +``` + +The script creates fp8 safetensors inside `${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate` and `${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate/model.safetensors.index.json` + +###### 2. Copy the following configs from your bf16 checkpoint to the folder + +``` +DEST=${DIST_FS_ROOT}/DeepSeek-sft-bf16-FP8E4M3_block128x128-fp8-gate + +cp $BF16_CHECK_POINT/config.json +cp $BF16_CHECK_POINT/configuration_deepseek.py ${DEST}/ +cp $BF16_CHECK_POINT/modeling_deepseek.py ${DEST}/ +cp $BF16_CHECK_POINT/tokenizer.json ${DEST}/ +cp $BF16_CHECK_POINT/tokenizer_config.json ${DEST}/ +``` + +Make sure you have add the following dict is added into config.json : + +``` +"quantization_config": { + "activation_scheme": "dynamic", + "fmt": "e4m3", + "quant_method": "fp8", + "weight_block_size": [128, 128], + "ignored_layers": [".*lm_head"], +} +``` + +We will make simple class to automate the process later. + +## BF16 upgrade training/inference + +This was originally created for inferenc eon non-fp8 capable chips. See details from [fp8_cast_bf16.py](./fp8_cast_bf16.py).