doc: followup #89 add client demo

This commit is contained in:
Yineng Zhang 2024-09-23 17:17:18 +08:00 committed by GitHub
parent 73e9dfc91b
commit 5f357ddb29
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -295,20 +295,37 @@ Assistant:
``` ```
### Inference with SGLang (recommended) ### Inference with SGLang (recommended)
[SGLang](https://github.com/sgl-project/sglang) currently supports MLA, FP8 (W8A8), FP8 KV Cache, CUDA Graph, and Torch Compile, offering the best performance among open source frameworks. Here are some examples of commands: [SGLang](https://github.com/sgl-project/sglang) currently supports MLA, FP8 (W8A8), FP8 KV Cache, CUDA Graph, and Torch Compile, offering the best performance among open source frameworks. Here are some example commands to launch an OpenAI API-compatible server:
```bash ```bash
# fp16 tp8 # BF16, tensor parallelism = 8
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code
# fp16 tp8 w/ torch compile # BF16, torch.compile
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --enable-torch-compile
# fp16 tp8 w/ torch compile, max torch compile batch size 1 # FP8, tensor parallelism = 8, FP8 KV cache
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile --max-torch-compile-bs 1 python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --kv-cache-dtype fp8_e5m2
```
# fp8 tp8 w/ torch compile, fp8 e5m2 kv cache After launching the server, you can query it with OpenAI API
python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --enable-torch-compile --kv-cache-dtype fp8_e5m2
```
import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)
``` ```
### Inference with vLLM (recommended) ### Inference with vLLM (recommended)