Update README_zh.md

This commit is contained in:
Daya Guo 2023-10-29 14:34:57 +08:00 committed by GitHub
parent 3b494868cc
commit 23c57b5459
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

View File

@ -1,22 +1,24 @@
<p align="center"> <p align="center">
<img width="1000px" alt="DeepSeek Coder" src="pictures/logo.jpeg"> <img width="1000px" alt="DeepSeek Coder" src="pictures/logo.jpeg">
</p> </p>
<p align="center"><a href="">[🏠 主页]</a> | <a href="">[🤖 在线体验] | <a href="">[🤗 模型下载]</a> | <a href="">[📄 英文版]</a> </p> <p align="center"><a href="https://www.deepseek.com/">[<img src="pictures/home.png" width="30px">主页]</a> | <a href="https://coder.deepseek.com/">[🤖 在线体验] | <a href="https://huggingface.co/deepseek-ai">[🤗 模型下载]</a> | <a href="README.md">[📄 English Version] </a> </p>
<hr> <hr>
### 1. Deepseek Coder简介 ### 1. Deepseek Coder简介
Deepseek Coder 包括一系列高级语言模型这些模型在87%的代码和13%的中英文自然语言数据上进行了预训练共2T的单词。 Deepseek Coder 包括一系列代码预训练模型这些模型在87%的代码和13%的中英文自然语言数据上进行了预训练共2T的单词。
Deepseek Coder提供各种参数大小的代码模型范围从1B到33B版本。每个模型都在项目级代码数据上进行预训练采用16K的窗口大小和额外的Fill-in-the-blank任务以支持项目级别的代码补全和填充。 Deepseek Coder提供各种参数大小的代码模型范围从1B到33B版本。每个模型都在项目级代码数据上进行预训练采用16K的窗口大小和额外的Fill-in-the-blank任务以支持项目级别的代码补全和填充。在代码能力方面Deepseek Coder在多种编程语言和各种测试基准测试上都达到了目前开源代码模型的最优性能。
在代码能力方面Deepseek Coder在多种编程语言和各种测试基准测试上都达到了目前开源代码模型的最优性能。
- **大量的训练数据**在2T单词上训练包括87%的代码和13%的英文和中文语言数据。 <img src="pictures/result.png" alt="result" width="85%">
- **高度灵活且可扩展**提供1B、7B和33B的模型大小使用户能够选择最适合其需求的模型 - **海量训练数据**在2万亿单词上进行训练包括87%的代码和13%的英文和中文语言数据
- **卓越的模型性能**:在 HumanEval-X, MultiPL-E, MBPP, DS-1000, 和 APPS 基准测试上DeepSeek Coder在公开可用的代码模型中性能最优 - **灵活可扩展**提供1B、7B和33B的模型大小使用户能够选择最适合其需求的模型
- **先进的代码补全能力**采用16K的窗口大小和Fill-in-the-blank训练任务支持项目级代码补全和填充任务。 - **模型性能强大**:在 HumanEval, MultiPL-E, MBPP, DS-1000, 和 APPS 基准测试上DeepSeek Coder在公开可用的代码模型中性能最优。
- **项目级代码补全**采用16K的窗口大小和Fill-in-the-blank训练任务支持项目级代码补全和填充任务。
### 2. 数据处理和模型训练 ### 2. 数据处理和模型训练
@ -24,27 +26,27 @@ Deepseek Coder提供各种参数大小的代码模型范围从1B到33B版本
- 步骤1从GitHub收集代码数据并采用与[StarcoderData](https://github.com/bigcode-project/bigcode-dataset)相同的过滤规则来筛选数据。 - 步骤1从GitHub收集代码数据并采用与[StarcoderData](https://github.com/bigcode-project/bigcode-dataset)相同的过滤规则来筛选数据。
- 步骤2解析同一仓库中文件的依赖关系根据它们的依赖关系重新排列文件位置。 - 步骤2解析同一仓库中文件的依赖关系根据它们的依赖关系重新排列文件位置。
- 步骤3组织依赖文件以形成单一示例并使用仓库级别的minhash进行去重。 - 步骤3组织依赖文件以形成单一示例并使用仓库级别的minhash算法进行去重。
- 步骤4进一步过滤掉低质量的代码例如语法错误或可读性差的代码。 - 步骤4进一步过滤掉低质量的代码例如语法错误或可读性差的代码。
![Data Clean Procedure](pictures/data_clean.png) <img src="pictures/data_clean.png" alt="data_creation" width="100%">
#### 模型训练 #### 模型训练
- 步骤1首先使用处理后数据进行预训练该数据由87%的代码、10%与代码相关的语言数据Github Markdown和Stack Exchange以及3%与代码无关的中文语言数据组成。在此步骤中,采用1.8T的单词和4K的窗口大小进行模型预训练。 - 步骤1首先使用处理后数据进行预训练该数据由87%的代码、10%与代码相关的语言数据Github Markdown和Stack Exchange以及3%与代码无关的中文语言数据组成。在此步骤中,使用4K的窗口大小在1.8万亿单词上进行模型的预训练。
- 步骤2扩展的窗口至16K并使用额外的200B单词进一步的进行预训练从而得到基础版本模型DeepSeek-Coder-Base - 步骤2扩展的窗口至16K并使用额外的2千亿单词进一步进行预训练,从而得到基础版本模型(**DeepSeek-Coder-Base**)。
- 步骤3使用300M单词的指令数据进行微调得到经过指令调优的模型DeepSeek-Coder-Instruct - 步骤3使用20亿单词的指令数据进行微调,得到经过指令调优的模型(**DeepSeek-Coder-Instruct**)。
![Model Pre-training](pictures/model_pretraining.png) <img src="pictures/model_pretraining.png" alt="model_pretraining" width="100%">
### 3. 下载和环境依赖 ### 3. 下载和环境依赖
Deepseek Coder 最初是在 Pytorch 中实现并在A100进行训练的。我们提供了基于Hai-LLM的 pytorch 兼容版本支持transformers(3.34+)以便在其他GPU平台上使用。 我们提供了基于Hai-LLM的 pytorch 兼容版本支持transformers(3.34+)以便在其他GPU平台上使用。
同时模型的权重已上传到至🤗 [huggingface](https://huggingface.co/deepseek-ai/deepseek-coder-7b)。 同时模型的权重已上传到至 [huggingface](https://huggingface.co/deepseek-ai)。
#### 环境依赖 #### 环境依赖
Python 3.8+ / CUDA 11+ / PyTorch 2.0+ / transformers 3.34+. Python 3.8+ / CUDA 11+ / PyTorch 2.0+ / transformers 3.34+.
@ -52,33 +54,72 @@ Python 3.8+ / CUDA 11+ / PyTorch 2.0+ / transformers 3.34+.
### 4. 模型推理 ### 4. 模型推理
请参考下面样例来使用我们模型: 请参考下面样例来使用我们模型:
#### 代码补全 #### 1代码补全
```python ```python
from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b") import torch
device = 0 if torch.cuda.is_available() else -1 tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b").to(device) model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
inputs = tokenizer("def hello_world():", return_tensors="pt").to(device) input_text = "#write a quick sort algorithm"
inputs = tokenizer(input_text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128) outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) print(tokenizer.decode(outputs[0], skip_special_tokens=True))
``` ```
#### 代码填充 这段代码将输入以下结果:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM ```
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b") def quick_sort(arr):
device = 0 if torch.cuda.is_available() else -1 if len(arr) <= 1:
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b").to(device) return arr
input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>" pivot = arr[0]
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) left = []
outputs = model.generate(**inputs, max_length=128) right = []
print(tokenizer.decode(outputs[0])) for i in range(1, len(arr)):
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)
``` ```
#### 仓库级别的代码补全 #### 2代码填充
```python ```python
from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b") import torch
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
input_text = """<fim_prefix>def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[0]
left = []
right = []
<fim_middle>
if arr[i] < pivot:
left.append(arr[i])
else:
right.append(arr[i])
return quick_sort(left) + [pivot] + quick_sort(right)<fim_suffix>"""
inputs = tokenizer(input_text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)[len(input_text):])
```
这段代码将输入以下结果:
```
for i in range(1, len(arr)):
```
#### 3仓库级别的代码补全
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
input_text = """#utils.py input_text = """#utils.py
import torch import torch
from sklearn import datasets from sklearn import datasets
@ -153,32 +194,101 @@ from model import IrisClassifier as Classifier
def main(): def main():
# Model training and evaluation # Model training and evaluation
""" """
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) inputs = tokenizer(input_text, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128) outputs = model.generate(**inputs, max_new_tokens=140)
print(tokenizer.decode(outputs[0])) print(tokenizer.decode(outputs[0]))
``` ```
--- ---
In the following scenario, the Deepseek-Coder 7B model effectively calls a class **IrisClassifier** and its member function from the `model.py` file, and also utilizes functions from the `utils.py` file, to correctly complete the **main** function in`main.py` file for model training and evaluation.
在下面样例中Deepseek-Coder 7B 模型有效地从 `model.py` 文件中调用了一个名为 `IrisClassifier` 的类及其成员函数,并利用了 `utils.py` 文件中的函数,以正确地完成`main.py` 文件中的模型的训练和评估的功能。 在下面样例中Deepseek-Coder 7B 模型有效地从 `model.py` 文件中调用了一个名为 `IrisClassifier` 的类及其成员函数,并利用了 `utils.py` 文件中的函数,以正确地完成`main.py` 文件中的模型的训练和评估的功能。
![Completion GIF](pictures/completion_demo.gif) ![Completion GIF](pictures/completion_demo.gif)
#### Chat功能 #### 4对话功能
```python ```python
from transformers import AutoTokenizer, AutoModelForCausalLM from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b") tokenizer = AutoTokenizer.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("deepseek/deepseek-coder-7b-base", trust_remote_code=True).cuda()
prompt = "write a quick sort algorithm in python." prompt = "write a quick sort algorithm in python."
prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\n\n### Instruction:\nWrite a program to perform the given task.\n\nInput:\n{prompt}\n\n### Response:\n""" prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context.\nWrite a response that appropriately completes the request.\n\n### Instruction:\nWrite a program to perform the given task.\n\nInput:\n{prompt}\n\n### Response:\n"""
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device) inputs = tokenizer.encode(prompt, return_tensors="pt").cuda()
outputs = model.generate(**inputs, max_length=128) outputs = model.generate(**inputs, max_length=128)
print(tokenizer.decode(outputs[0])) print(tokenizer.decode(outputs[0]))
``` ```
### 5. Lincense
### 6. Citation
### 5. 评测结果
以下评测结果的复现代码可查看[Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation)目录
#### 1) [HumanEval](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation/HumanEval)
##### Multilingual Base Models
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
| ------------------- | ---- | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| code-cushman-001 | 12B | 33.5% | 31.9% | 30.6% | 28.9% | 31.3% | 22.1% | 11.7% | - | - |
| CodeShell | 7B | 35.4% | 32.9% | 34.2% | 31.7% | 30.2% | 38.0% | 7.0% | 33.5% | 30.4% |
| CodeGeeX2 | 6B | 36.0% | 29.2% | 25.9% | 23.6% | 20.8% | 29.7% | 6.3% | 24.8% | 24.5% |
| StarCoderBase | 16B | 31.7% | 31.1% | 28.5% | 25.4% | 34.0% | 34.8% | 8.9% | 29.8% | 28.0% |
| CodeLLama (7B) | 7B | 31.7% | 29.8% | 34.2% | 23.6% | 36.5% | 36.7% | 12.0% | 29.2% | 29.2% |
| CodeLLama (13B) | 13B | 36.0% | 37.9% | 38.0% | 34.2% | 45.2% | 43.0% | 16.5% | 32.3% | 35.4% |
| CodeLLama (34B) | 34B | 48.2% | 44.7% | 44.9% | 41.0% | 42.1% | 48.7% | 15.8% | 42.2% | 41.0% |
| | | | | | | | | | | |
| OraCoder-Base (1B) | 1B | 34.8% | 31.1% | 32.3% | 24.2% | 28.9% | 36.7% | 10.1% | 28.6% | 28.3% |
| OraCoder-Base (7B) | 7B | 49.4% | 50.3% | 43.0% | 38.5% | 49.7% | 50.0% | 28.5% | 48.4% | 44.7% |
| OraCoder-Base (33B) | 33B | - | - | - | - | - | - | - | - | - |
##### Instruction-Tuned Models
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
| ------------------- | ---- | ------ | ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |
| ChatGPT | - | 70.7% | 50.3% | 54.5% | 52.2% | 62.3% | 64.6% | 34.8% | 60.9% | 52.2% |
| GPT-4 | - | 82.3% | 70.2% | 74.8% | 70.8% | 73.0% | 77.9% | 51.3% | 83.2% | 72.9% |
| WizardCoder | 16B | 51.8% | 41.6% | 41.1% | 42.2% | 44.7% | 46.8% | 12.7% | 42.8% | 40.5% |
| Phind-CodeLlama | 34B | - | - | - | - | - | - | - | - | - |
| | | | | | | | | | | |
| OraCoder-Chat (1B) | 1B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (7B) | 7B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (33B) | 33B | - | - | - | - | - | - | - | - | - |
#### 2) [Math Reasoning](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation/PAL-Math)
##### Multilingual Base Models
| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg |
| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
| CodeShell | 7B | 17.0% | 9.1% | 18.2% | 45.6% | 29.6% | 46.6% | 56.8% | 31.8% |
| CodeGeex-2 | 7B | 23.6% | 9.6% | 22.4% | 48.0% | 47.2% | 46.9% | 66.0% | 37.7% |
| StarCoder-Base | 16B | 27.3% | 11.5% | 24.2% | 44.0% | 45.6% | 54.9% | 73.4% | 40.1% |
| CodeLLama-Base | 7B | 36.4% | 12.3% | 29.7% | 57.6% | 58.4% | 59.6% | 82.6% | 48.0% |
| CodeLLama-Base | 13B | 44.2% | 15.5% | 42.4% | 65.6% | 61.6% | 65.3% | 85.3% | 54.3% |
| CodeLLama-Base | 34B | 58.2% | 22.1% | 55.2% | 77.2% | 69.6% | 70.0% | 92.8% | 63.6% |
| | | | | | | | | | |
| OraCoder-Base | 1B | 17.0% | 13.4% | 13.3% | 39.2% | 42.4% | 44.8% | 66.0% | 33.7% |
| OraCoder-Base | 7B | 46.0% | 20.6% | 40.0% | 67.2% | 71.2% | 67.1% | 89.1% | 57.3% |
| OraCoder-Base | 33B | - | - | - | - | - | - | - | - |
##### Instruction-Tuned Models
| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg |
| ------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
| ChatGPT | - | 78.6% | 38.7% | 67.6% | 77.8% | 79.9% | 81.0% | 89.4% | 73.3% |
| GPT-4 | - | 94.2% | 51.8% | 77.6% | 94.8% | 95.9% | 92.6% | 97.7% | 86.4% |
| | | | | | | | | | |
| OraCoder-Chat | 1B | - | - | - | - | - | - | - | - |
| OraCoder-Chat | 7B | - | - | - | - | - | - | - | - |
| OraCoder-Chat | 33B | - | - | - | - | - | - | - | - |
### 6. 协议
### 7. 联系方式
如果有任何问题请提出raise或通过 [agi_code@deepseek.com](mailto:agi_code@deepseek.com) 与我们联系。