From 30b7c65fb600cd0c3b842b456cc713f31c5a428e Mon Sep 17 00:00:00 2001 From: Afueth Thomas <97304915+Afueth@users.noreply.github.com> Date: Mon, 27 Jan 2025 15:16:42 +0530 Subject: [PATCH] Update README.md Updated the capitalization of the word "recommended" to "Recommended" in a heading to ensure consistency with title case formatting throughout the document. This change aligns the heading style with the rest of the README for a more polished and professional appearance. --- README.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 7ecf87e..9b6496a 100644 --- a/README.md +++ b/README.md @@ -304,7 +304,7 @@ Or batch inference on a given file: torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE ``` -### 6.2 Inference with SGLang (recommended) +### 6.2 Inference with SGLang (Recommended) [SGLang](https://github.com/sgl-project/sglang) currently supports [MLA optimizations](https://lmsys.org/blog/2024-09-04-sglang-v0-3/#deepseek-multi-head-latent-attention-mla-throughput-optimizations), [DP Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models), FP8 (W8A8), FP8 KV Cache, and Torch Compile, delivering state-of-the-art latency and throughput performance among open-source frameworks. @@ -316,18 +316,18 @@ Multi-Token Prediction (MTP) is in development, and progress can be tracked in t Here are the launch instructions from the SGLang team: https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3 -### 6.3 Inference with LMDeploy (recommended) +### 6.3 Inference with LMDeploy (Recommended) [LMDeploy](https://github.com/InternLM/lmdeploy), a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows. For comprehensive step-by-step instructions on running DeepSeek-V3 with LMDeploy, please refer to here: https://github.com/InternLM/lmdeploy/issues/2960 -### 6.4 Inference with TRT-LLM (recommended) +### 6.4 Inference with TRT-LLM (Recommended) [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly: https://github.com/NVIDIA/TensorRT-LLM/tree/deepseek/examples/deepseek_v3. -### 6.5 Inference with vLLM (recommended) +### 6.5 Inference with vLLM (Recommended) [vLLM](https://github.com/vllm-project/vllm) v0.6.6 supports DeepSeek-V3 inference for FP8 and BF16 modes on both NVIDIA and AMD GPUs. Aside from standard techniques, vLLM offers _pipeline parallelism_ allowing you to run this model on multiple machines connected by networks. For detailed guidance, please refer to the [vLLM instructions](https://docs.vllm.ai/en/latest/serving/distributed_serving.html). Please feel free to follow [the enhancement plan](https://github.com/vllm-project/vllm/issues/11539) as well.