DeepSeek-Coder/README_zh.md
2023-10-29 14:34:57 +08:00

295 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<p align="center">
<img width="1000px" alt="DeepSeek Coder" src="pictures/logo.jpeg">
</p>
<p align="center"><a href="https://www.deepseek.com/">[<img src="pictures/home.png" width="30px">主页]</a> | <a href="https://coder.deepseek.com/">[🤖 在线体验] | <a href="https://huggingface.co/deepseek-ai">[🤗 模型下载]</a> | <a href="README.md">[📄 English Version] </a> </p>
<hr>
### 1. Deepseek Coder简介
Deepseek Coder 包括一系列代码预训练模型这些模型在87%的代码和13%的中英文自然语言数据上进行了预训练共2T的单词。
Deepseek Coder提供各种参数大小的代码模型范围从1B到33B版本。每个模型都在项目级代码数据上进行预训练采用16K的窗口大小和额外的Fill-in-the-blank任务以支持项目级别的代码补全和填充。在代码能力方面Deepseek Coder在多种编程语言和各种测试基准测试上都达到了目前开源代码模型的最优性能。
<img src="pictures/result.png" alt="result" width="85%">
- **海量训练数据**在2万亿单词上进行训练包括87%的代码和13%的英文和中文语言数据。
- **灵活可扩展**提供1B、7B和33B的模型大小使用户能够选择最适合其需求的模型。
- **模型性能强大**:在 HumanEval, MultiPL-E, MBPP, DS-1000, 和 APPS 基准测试上DeepSeek Coder在公开可用的代码模型中性能最优。
- **项目级代码补全**采用16K的窗口大小和Fill-in-the-blank训练任务支持项目级代码补全和填充任务。
### 2. 数据处理和模型训练
#### 数据处理
- 步骤1从GitHub收集代码数据并采用与[StarcoderData](https://github.com/bigcode-project/bigcode-dataset)相同的过滤规则来筛选数据。
- 步骤2解析同一仓库中文件的依赖关系根据它们的依赖关系重新排列文件位置。
- 步骤3组织依赖文件以形成单一示例并使用仓库级别的minhash算法进行去重。
- 步骤4进一步过滤掉低质量的代码例如语法错误或可读性差的代码。
<img src="pictures/data_clean.png" alt="data_creation" width="100%">
#### 模型训练
- 步骤1首先使用处理后数据进行预训练该数据由87%的代码、10%与代码相关的语言数据Github Markdown和Stack Exchange以及3%与代码无关的中文语言数据组成。在此步骤中使用4K的窗口大小在1.8万亿单词上进行模型的预训练。
- 步骤2扩展的窗口至16K并使用额外的2千亿单词进一步进行预训练从而得到基础版本模型**DeepSeek-Coder-Base**)。
- 步骤3使用20亿单词的指令数据进行微调得到经过指令调优的模型**DeepSeek-Coder-Instruct**)。
<img src="pictures/model_pretraining.png" alt="model_pretraining" width="100%">
### 3. 下载和环境依赖
我们提供了基于Hai-LLM的 pytorch 兼容版本支持transformers(3.34+)以便在其他GPU平台上使用。
同时模型的权重已上传到至 [huggingface](https://huggingface.co/deepseek-ai)。
#### 环境依赖
Python 3.8+ / CUDA 11+ / PyTorch 2.0+ / transformers 3.34+.
### 4. 模型推理
请参考下面样例来使用我们模型:
#### 1代码补全
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
这段代码将输入以下结果:
```
def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
for i in range(1, len(arr)):
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)
```
#### 2代码填充
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
input_text = """<fim_prefix>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<fim_middle>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<fim_suffix>"""
inputs = tokenizer(input_text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])
```
这段代码将输入以下结果:
```
for i in range(1, len(arr)):
```
#### 3仓库级别的代码补全
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
input_text = """#utils.py
import torch
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
def load_data():
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Convert numpy data to PyTorch tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.int64)
y_test = torch.tensor(y_test, dtype=torch.int64)
return X_train, X_test, y_train, y_test
def evaluate_predictions(y_test, y_pred):
return accuracy_score(y_test, y_pred)
#model.py
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class IrisClassifier(nn.Module):
def __init__(self):
super(IrisClassifier, self).__init__()
self.fc = nn.Sequential(
nn.Linear(4, 16),
nn.ReLU(),
nn.Linear(16, 3)
)
def forward(self, x):
return self.fc(x)
def train_model(self, X_train, y_train, epochs, lr, batch_size):
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(self.parameters(), lr=lr)
# Create DataLoader for batches
dataset = TensorDataset(X_train, y_train)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
for epoch in range(epochs):
for batch_X, batch_y in dataloader:
optimizer.zero_grad()
outputs = self(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
def predict(self, X_test):
with torch.no_grad():
outputs = self(X_test)
_, predicted = outputs.max(1)
return predicted.numpy()
#main.py
from utils import load_data, evaluate_predictions
from model import IrisClassifier as Classifier
def main():
# Model training and evaluation
"""
inputs = tokenizer(input_text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0]))
```
---
在下面样例中Deepseek-Coder 7B 模型有效地从 `model.py` 文件中调用了一个名为 `IrisClassifier` 的类及其成员函数,并利用了 `utils.py` 文件中的函数,以正确地完成`main.py` 文件中的模型的训练和评估的功能。
![Completion GIF](pictures/completion_demo.gif)
#### 4对话功能
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
prompt = "write a quick sort algorithm in python."
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\n\n### Instruction:\nWrite a program to perform the given task.\n\nInput:\n{prompt}\n\n### Response:\n"""
inputs = tokenizer.encode(prompt, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0]))
```
### 5. 评测结果
以下评测结果的复现代码可查看[Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation)目录
#### 1) [HumanEval](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation/HumanEval)
##### Multilingual Base Models
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
| ------------------- | ---- | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| code-cushman-001 | 12B | 33.5% | 31.9% | 30.6% | 28.9% | 31.3% | 22.1% | 11.7% | - | - |
| CodeShell | 7B | 35.4% | 32.9% | 34.2% | 31.7% | 30.2% | 38.0% | 7.0% | 33.5% | 30.4% |
| CodeGeeX2 | 6B | 36.0% | 29.2% | 25.9% | 23.6% | 20.8% | 29.7% | 6.3% | 24.8% | 24.5% |
| StarCoderBase | 16B | 31.7% | 31.1% | 28.5% | 25.4% | 34.0% | 34.8% | 8.9% | 29.8% | 28.0% |
| CodeLLama (7B) | 7B | 31.7% | 29.8% | 34.2% | 23.6% | 36.5% | 36.7% | 12.0% | 29.2% | 29.2% |
| CodeLLama (13B) | 13B | 36.0% | 37.9% | 38.0% | 34.2% | 45.2% | 43.0% | 16.5% | 32.3% | 35.4% |
| CodeLLama (34B) | 34B | 48.2% | 44.7% | 44.9% | 41.0% | 42.1% | 48.7% | 15.8% | 42.2% | 41.0% |
| | | | | | | | | | | |
| OraCoder-Base (1B) | 1B | 34.8% | 31.1% | 32.3% | 24.2% | 28.9% | 36.7% | 10.1% | 28.6% | 28.3% |
| OraCoder-Base (7B) | 7B | 49.4% | 50.3% | 43.0% | 38.5% | 49.7% | 50.0% | 28.5% | 48.4% | 44.7% |
| OraCoder-Base (33B) | 33B | - | - | - | - | - | - | - | - | - |
##### Instruction-Tuned Models
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
| ------------------- | ---- | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| ChatGPT | - | 70.7% | 50.3% | 54.5% | 52.2% | 62.3% | 64.6% | 34.8% | 60.9% | 52.2% |
| GPT-4 | - | 82.3% | 70.2% | 74.8% | 70.8% | 73.0% | 77.9% | 51.3% | 83.2% | 72.9% |
| WizardCoder | 16B | 51.8% | 41.6% | 41.1% | 42.2% | 44.7% | 46.8% | 12.7% | 42.8% | 40.5% |
| Phind-CodeLlama | 34B | - | - | - | - | - | - | - | - | - |
| | | | | | | | | | | |
| OraCoder-Chat (1B) | 1B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (7B) | 7B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (33B) | 33B | - | - | - | - | - | - | - | - | - |
#### 2) [Math Reasoning](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation/PAL-Math)
##### Multilingual Base Models
| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg |
| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
| CodeShell | 7B | 17.0% | 9.1% | 18.2% | 45.6% | 29.6% | 46.6% | 56.8% | 31.8% |
| CodeGeex-2 | 7B | 23.6% | 9.6% | 22.4% | 48.0% | 47.2% | 46.9% | 66.0% | 37.7% |
| StarCoder-Base | 16B | 27.3% | 11.5% | 24.2% | 44.0% | 45.6% | 54.9% | 73.4% | 40.1% |
| CodeLLama-Base | 7B | 36.4% | 12.3% | 29.7% | 57.6% | 58.4% | 59.6% | 82.6% | 48.0% |
| CodeLLama-Base | 13B | 44.2% | 15.5% | 42.4% | 65.6% | 61.6% | 65.3% | 85.3% | 54.3% |
| CodeLLama-Base | 34B | 58.2% | 22.1% | 55.2% | 77.2% | 69.6% | 70.0% | 92.8% | 63.6% |
| | | | | | | | | | |
| OraCoder-Base | 1B | 17.0% | 13.4% | 13.3% | 39.2% | 42.4% | 44.8% | 66.0% | 33.7% |
| OraCoder-Base | 7B | 46.0% | 20.6% | 40.0% | 67.2% | 71.2% | 67.1% | 89.1% | 57.3% |
| OraCoder-Base | 33B | - | - | - | - | - | - | - | - |
##### Instruction-Tuned Models
| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg |
| ------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
| ChatGPT | - | 78.6% | 38.7% | 67.6% | 77.8% | 79.9% | 81.0% | 89.4% | 73.3% |
| GPT-4 | - | 94.2% | 51.8% | 77.6% | 94.8% | 95.9% | 92.6% | 97.7% | 86.4% |
| | | | | | | | | | |
| OraCoder-Chat | 1B | - | - | - | - | - | - | - | - |
| OraCoder-Chat | 7B | - | - | - | - | - | - | - | - |
| OraCoder-Chat | 33B | - | - | - | - | - | - | - | - |
### 6. 协议
### 7. 联系方式
如果有任何问题请提出raise或通过 [agi_code@deepseek.com](mailto:agi_code@deepseek.com) 与我们联系。