DeepSeek-MoE

mirror of https://github.com/deepseek-ai/DeepSeek-MoE.git synced 2025-07-18 15:09:07 -04:00

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Go to file

DeepSeekDDM 7c1c2a96d3 evaluation results		2024-01-09 14:41:41 +08:00
finetune	initial commit	2024-01-09 13:38:48 +08:00
images	evaluation results	2024-01-09 14:41:41 +08:00
LICENSE-CODE	initial commit	2024-01-09 13:38:48 +08:00
LICENSE-MODEL	initial commit	2024-01-09 13:38:48 +08:00
README.md	initial commit	2024-01-09 13:38:48 +08:00
requirements.txt	initial commit	2024-01-09 13:38:48 +08:00

README.md

Model Download | Evaluation Results | Quick Start | License | Citation

1. Introduction

2. Evaluation Results

3. Model Downloads

We release the DeepSeek MoE 16B, including both base and chat models, to the public. To support a broader and more diverse range of research within both academic and commercial communities. Please note that the use of this model is subject to the terms outlined in License section. Commercial usage is permitted under these terms.

Huggingface

Model	Sequence Length	Download
DeepSeek MoE 16B Base	4096	🤗 HuggingFace
DeepSeek MoE 16B Chat	4096	🤗 HuggingFace

4. Quick Start

Installation

On the basis of Python >= 3.8 environment, install the necessary dependencies by running the following command:

pip install -r requirements.txt

Inference with Huggingface's Transformers

You can directly employ Huggingface's Transformers for model inference.

Text Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-ai/deepseek-moe-16b-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Chat Completion

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name = "deepseek-ai/deepseek-moe-16b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id

messages = [
    {"role": "user", "content": "Who are you?"}
]
input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)

result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
print(result)

Avoiding the use of the provided function apply_chat_template, you can also interact with our model following the sample template. Note that messages should be replaced by your input.

User: {messages[0]['content']}

Assistant: {messages[1]['content']}<｜end▁of▁sentence｜>User: {messages[2]['content']}

Assistant:

Note: By default (add_special_tokens=True), our tokenizer automatically adds a bos_token (<｜begin▁of▁sentence｜>) before the input text. Additionally, since the system prompt is not compatible with this version of our models, we DO NOT RECOMMEND including the system prompt in your input.

How to Fine-tune DeepSeek-MoE

We provide script fintune/finetune.py for users to finetune our models on downstream tasks.

The script supports the training with DeepSpeed. You need install required packages by:

pip install -r requirements.txt

Please follow Sample Dataset Format to prepare your training data. Each item has two required fields instruction and output.

After data preparation, you can use the sample shell script to finetune deepseek-MoE model. Remember to specify DATA_PATH, OUTPUT_PATH. And please choose appropriate hyper-parameters(e.g., learning_rate, per_device_train_batch_size) according to your scenario.

DATA_PATH="<your_data_path>"
OUTPUT_PATH="<your_output_path>"
MODEL_PATH="<your_model_path>"

cd finetune
deepspeed finetune.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --output_dir $OUTPUT_PATH \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --warmup_steps 10 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --gradient_checkpointing True \
    --report_to "tensorboard" \
    --deepspeed configs/ds_config_zero3.json \
    --bf16 True \
    --use_lora False

You can also finetune the model with 4/8-bits qlora, feel free to try it.

DATA_PATH="<your_data_path>"
OUTPUT_PATH="<your_output_path>"
MODEL_PATH="<your_model_path>"

cd finetune
deepspeed finetune.py \
    --model_name_or_path $MODEL_PATH \
    --data_path $DATA_PATH \
    --output_dir $OUTPUT_PATH \
    --num_train_epochs 3 \
    --model_max_length 1024 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 100 \
    --learning_rate 2e-5 \
    --warmup_steps 10 \
    --logging_steps 1 \
    --lr_scheduler_type "cosine" \
    --gradient_checkpointing True \
    --report_to "tensorboard" \
    --deepspeed configs/ds_config_zero2_no_offload.json \
    --bf16 True \
    --use_lora True \
    --bits 4 \
    --max_grad_norm 0.3 \
    --double_quant \
    --lora_r 64 \
    --lora_alpha 16 \
    --quant_type nf4 \

5. License

This code repository is licensed under the MIT License. The use of DeepSeek models is subject to the Model License. DeepSeek supports commercial use.

See the LICENSE-CODE and LICENSE-MODEL for more details.

6. Citation

7. Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.

README.md Unescape Escape