Updated the introductory sentence in the "Introduction" section to improve clarity and readability.
Changed:
"We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token."
To: "DeepSeek-V3 is a powerful Mixture-of-Experts (MoE) language model with 671 billion total parameters, of which 37 billion are activated per token."
This revision ensures conciseness and better emphasis on key details.
This file includes detailed citation information for the DeepSeek-V3 project, such as authors, DOI, license, and key project details. It enables users to properly cite the work and promotes better academic and professional attribution.
* handle missing scale_inv_name
Fixed an issue where `weight` and `weight_scale_inv` (e.g. `model.layers.39.mlp.experts.92.gate_proj.weight` and `model.layers.39.mlp.experts.92.gate_proj.weight_scale_inv`) were not in the same SafeTensor, causing an assertion error due to scale_inv_name not being in the state_dict.
* sort filename to reduce memory costs
* Add CUDA cache clearing in memory management
Added torch.cuda.empty_cache() to free up unused memory on the GPU,