mirror of
https://github.com/deepseek-ai/DeepSeek-Coder.git
synced 2025-06-19 16:03:47 -04:00
update evaluation
This commit is contained in:
parent
aaa5c54e6c
commit
4d854b8955
@ -1,6 +1,6 @@
|
||||
## 1. Introduction
|
||||
|
||||
We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval](https://huggingface.co/datasets/openai_humaneval), [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
|
||||
We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
|
||||
|
||||
|
||||
|
||||
@ -14,7 +14,6 @@ pip install pytorch
|
||||
```
|
||||
|
||||
|
||||
|
||||
## 3. Evaluation
|
||||
|
||||
We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1b-python** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.
|
||||
@ -35,7 +34,6 @@ python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py
|
||||
We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.
|
||||
|
||||
|
||||
|
||||
#### (1) Multilingual Base Models
|
||||
|
||||
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
|
||||
@ -55,13 +53,12 @@ We report experimental results here for 8 main-stream programming languages, **p
|
||||
#### (3) Instruction-Tuned Models
|
||||
| Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg |
|
||||
|---------------------|------|--------|-------|------|------|------|------|------|------|------|
|
||||
| ChatGPT | - | 70.7% | 50.3% | 54.5%| 52.2%| 62.3%| 64.6%| 34.8%| 60.9%| 52.2%|
|
||||
| GPT-4 | - | 82.3% | 70.2% | 74.8%| 70.8%| 73.0%| 77.9%| 51.3%| 83.2%| 72.9%|
|
||||
| GPT35-turbo | - | 76.2% | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%|
|
||||
| GPT-4 | - | 84.1% | 76.4% | 81.6%| 77.2%| 77.4%| 79.1%| 58.2%| 78.0%| 76.5%|
|
||||
| WizardCoder | 16B | 51.8% | 41.6% | 41.1%| 42.2%| 44.7%| 46.8%| 12.7%| 42.8%| 40.5%|
|
||||
| Phind-CodeLlama | 34B | - | - | - | - | - | - | - | - | - |
|
||||
| | | | | | | | | | | |
|
||||
| OraCoder-Chat (1B) | 1B | - | - | - | - | - | - | - | - | - |
|
||||
| OraCoder-Chat (7B) | 7B | - | - | - | - | - | - | - | - | - |
|
||||
| OraCoder-Chat (33B) | 33B | - | - | - | - | - | - | - | - | - |
|
||||
|
||||
| OraCoder-Chat (7B) | 7B | 78.9% | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%|
|
||||
| OraCoder-Chat (33B) | 33B | 79.3% | 68.9% | 73.4% | 72.7%| 67.9%| 74.1%| 43.0%| 73.9%| 69.2%|
|
||||
|
||||
|
18
README.md
18
README.md
@ -9,7 +9,9 @@
|
||||
|
||||
Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.
|
||||
|
||||
<img src="pictures/result.png" alt="result" width="85%">
|
||||
<p align="center">
|
||||
<img src="pictures/result.png" alt="result" width="80%">
|
||||
</p>
|
||||
|
||||
- **Massive Training Data**: Trained on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
|
||||
|
||||
@ -29,7 +31,8 @@ Deepseek Coder comprises a series of code language models trained on both 87% co
|
||||
- Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
|
||||
- Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
|
||||
- Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.
|
||||
- <img src="pictures/data_clean.png" alt="data_creation" width="100%">
|
||||
|
||||
<img src="pictures/data_clean.png" alt="data_creation" width="100%">
|
||||
|
||||
#### Model Training
|
||||
|
||||
@ -203,9 +206,18 @@ print(tokenizer.decode(outputs[0]))
|
||||
```
|
||||
|
||||
### 5. Evaluation Results
|
||||
We evaluate DeepSeek Coder on various coding-related benchmarks.
|
||||
The `passk@1` results on HumanEval (Python and Multilingual), MBPP, DS-1000 are reported as follows:
|
||||
|
||||
The reproducible code for the following evaluation results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory.
|
||||
<p align="center">
|
||||
<img src="pictures/table.png" alt="table" width="85%">
|
||||
</p>
|
||||
|
||||
The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLLama34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000.
|
||||
Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B.
|
||||
And the DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT35-turbo on HumanEval and achieves comparable result with GPT35-turbo on MBPP.
|
||||
|
||||
More evaluation details and reproducible code for above results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory.
|
||||
|
||||
### 6. Lincense
|
||||
This code repository is licensed under the MIT License. The use of DeepSeek Coder model and weights is subject to the Model License. DeepSeek Coder supports commercial use.
|
||||
|
Binary file not shown.
Before Width: | Height: | Size: 1.5 MiB After Width: | Height: | Size: 194 KiB |
BIN
pictures/table.png
Normal file
BIN
pictures/table.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 553 KiB |
Loading…
Reference in New Issue
Block a user