- Improved readability and structure of Triton kernels for FP8 weight dequantization and matrix multiplication (GEMM)
- Added comments for clarity
- Replaced hardcoded block sizes with configurable parameters
- Improved safety using tl.cdiv and masking
- Renamed variables and ensured consistency in naming
Here are the improvements made to the code for your commit message:
Refactored init_distributed function: Extracted distributed setup logic into a separate function.
Updated sample function: Replaced exponential approach with torch.multinomial for sampling.
Improved argument validation: Replaced assert with a more user-friendly validation in main to ensure at least one parameter (input-file or interactive) is provided.
Refactored interactive mode logic: Maintained user interaction logic but moved init_distributed call to the beginning of main.