Implement a more robust Mixture of Experts (MoE) solution that handles
dynamic shapes in PyTorch. The implementation avoids GuardOnDataDependentSymNode
errors by:
- Using masked operations instead of data-dependent control flow
- Providing a cleaner alternative to error suppression
- Including a test file to verify both regular and compiled model behavior
The solution offers two approaches:
1. Quick fix via torch._dynamo.config.suppress_errors
2. Robust implementation using masked operations and proper weight handling
This file includes detailed citation information for the DeepSeek-V3 project, such as authors, DOI, license, and key project details. It enables users to properly cite the work and promotes better academic and professional attribution.
* handle missing scale_inv_name
Fixed an issue where `weight` and `weight_scale_inv` (e.g. `model.layers.39.mlp.experts.92.gate_proj.weight` and `model.layers.39.mlp.experts.92.gate_proj.weight_scale_inv`) were not in the same SafeTensor, causing an assertion error due to scale_inv_name not being in the state_dict.
* sort filename to reduce memory costs
* Add CUDA cache clearing in memory management
Added torch.cuda.empty_cache() to free up unused memory on the GPU,