vThis update introduces parallel processing for token generation using torch.multiprocessing.Pool.
The new implementation improves inference speed by processing multiple sequences concurrently.
- Added the generate_parallel() function for parallel token generation.
- Used multiprocessing to distribute the workload across multiple processes, allowing for faster generation of tokens for multiple prompts.
- The generate_single_sequence() function was added to handle individual sequence generation logic, which is called by each worker in parallel.
- The num_workers parameter is introduced to control the number of worker processes (default is 4).
- Model is shared across processes for efficient memory usage.
These changes are particularly beneficial for batch processing or multi-prompt generation scenarios where multiple sequences need to be generated simultaneously.
This file includes detailed citation information for the DeepSeek-V3 project, such as authors, DOI, license, and key project details. It enables users to properly cite the work and promotes better academic and professional attribution.
* handle missing scale_inv_name
Fixed an issue where `weight` and `weight_scale_inv` (e.g. `model.layers.39.mlp.experts.92.gate_proj.weight` and `model.layers.39.mlp.experts.92.gate_proj.weight_scale_inv`) were not in the same SafeTensor, causing an assertion error due to scale_inv_name not being in the state_dict.
* sort filename to reduce memory costs
* Add CUDA cache clearing in memory management
Added torch.cuda.empty_cache() to free up unused memory on the GPU,