update evaluation

This commit is contained in:
Dejian 2023-11-01 15:04:54 +08:00
parent aaa5c54e6c
commit 4d854b8955
4 changed files with 20 additions and 11 deletions

View File

@ -1,6 +1,6 @@
## 1. Introduction
We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval](https://huggingface.co/datasets/openai_humaneval), [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
@ -14,7 +14,6 @@ pip install pytorch
```
## 3. Evaluation
We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1b-python** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.
@ -35,7 +34,6 @@ python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py
We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.
#### (1) Multilingual Base Models
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
@ -55,13 +53,12 @@ We report experimental results here for 8 main-stream programming languages, **p
#### (3) Instruction-Tuned Models
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
|---------------------|------|--------|-------|------|------|------|------|------|------|------|
| ChatGPT | - | 70.7% | 50.3% | 54.5%| 52.2%| 62.3%| 64.6%| 34.8%| 60.9%| 52.2%|
| GPT-4 | - | 82.3% | 70.2% | 74.8%| 70.8%| 73.0%| 77.9%| 51.3%| 83.2%| 72.9%|
| GPT35-turbo | - | 76.2% | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%|
| GPT-4 | - | 84.1% | 76.4% | 81.6%| 77.2%| 77.4%| 79.1%| 58.2%| 78.0%| 76.5%|
| WizardCoder | 16B | 51.8% | 41.6% | 41.1%| 42.2%| 44.7%| 46.8%| 12.7%| 42.8%| 40.5%|
| Phind-CodeLlama | 34B | - | - | - | - | - | - | - | - | - |
| | | | | | | | | | | |
| OraCoder-Chat (1B) | 1B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (7B) | 7B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (33B) | 33B | - | - | - | - | - | - | - | - | - |
| OraCoder-Chat (7B) | 7B | 78.9% | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%|
| OraCoder-Chat (33B) | 33B | 79.3% | 68.9% | 73.4% | 72.7%| 67.9%| 74.1%| 43.0%| 73.9%| 69.2%|

View File

@ -9,7 +9,9 @@
Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
<img src="pictures/result.png" alt="result" width="85%">
<p align="center">
<img src="pictures/result.png" alt="result" width="80%">
</p>
- **Massive Training Data**: Trained on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
@ -29,7 +31,8 @@ Deepseek Coder comprises a series of code language models trained on both 87% co
- Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
- Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
- Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.
- <img src="pictures/data_clean.png" alt="data_creation" width="100%">
<img src="pictures/data_clean.png" alt="data_creation" width="100%">
#### Model Training
@ -203,9 +206,18 @@ print(tokenizer.decode(outputs[0]))
```
### 5. Evaluation Results
We evaluate DeepSeek Coder on various coding-related benchmarks.
The `passk@1` results on HumanEval (Python and Multilingual), MBPP, DS-1000 are reported as follows:
The reproducible code for the following evaluation results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory.
<p align="center">
<img src="pictures/table.png" alt="table" width="85%">
</p>
The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLLama34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000.
Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B.
And the DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT35-turbo on HumanEval and achieves comparable result with GPT35-turbo on MBPP.
More evaluation details and reproducible code for above results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory.
### 6. Lincense
This code repository is licensed under the MIT License. The use of DeepSeek Coder model and weights is subject to the Model License. DeepSeek Coder supports commercial use.

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.5 MiB

After

Width:  |  Height:  |  Size: 194 KiB

BIN
pictures/table.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 553 KiB