From 5f357ddb29c8fcf9439e381c1c7568ff79b51625 Mon Sep 17 00:00:00 2001 From: Yineng Zhang Date: Mon, 23 Sep 2024 17:17:18 +0800 Subject: [PATCH] doc: followup #89 add client demo --- README.md | 33 +++++++++++++++++++++++++-------- 1 file changed, 25 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index ed6ff1d..00892c5 100644 --- a/README.md +++ b/README.md @@ -295,20 +295,37 @@ Assistant: ``` ### Inference with SGLang (recommended) -[SGLang](https://github.com/sgl-project/sglang) currently supports MLA, FP8 (W8A8), FP8 KV Cache, CUDA Graph, and Torch Compile, offering the best performance among open source frameworks. Here are some examples of commands: +[SGLang](https://github.com/sgl-project/sglang) currently supports MLA, FP8 (W8A8), FP8 KV Cache, CUDA Graph, and Torch Compile, offering the best performance among open source frameworks. Here are some example commands to launch an OpenAI API-compatible server: ```bash -# fp16 tp8 +# BF16, tensor parallelism = 8 python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code -# fp16 tp8 w/ torch compile -python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile +# BF16, torch.compile +python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --trust-remote-code --enable-torch-compile -# fp16 tp8 w/ torch compile, max torch compile batch size 1 -python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --tp 8 --trust-remote-code --enable-torch-compile --max-torch-compile-bs 1 +# FP8, tensor parallelism = 8, FP8 KV cache +python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --kv-cache-dtype fp8_e5m2 +``` -# fp8 tp8 w/ torch compile, fp8 e5m2 kv cache -python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --tp 8 --trust-remote-code --enable-torch-compile --kv-cache-dtype fp8_e5m2 +After launching the server, you can query it with OpenAI API + +``` +import openai +client = openai.Client( + base_url="http://127.0.0.1:30000/v1", api_key="EMPTY") + +# Chat completion +response = client.chat.completions.create( + model="default", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant"}, + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) +print(response) ``` ### Inference with vLLM (recommended)