update evaluation

2025-06-19 16:03:47 -04:00 · 2023-11-01 15:04:54 +08:00 · 2023-11-01 15:04:54 +08:00 · 4d854b8955
commit 4d854b8955
parent aaa5c54e6c
4 changed files with 20 additions and 11 deletions
--- a/Evaluation/HumanEval/README.md
+++ b/Evaluation/HumanEval/README.md
@ -1,6 +1,6 @@
 ## 1. Introduction

-We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval](https://huggingface.co/datasets/openai_humaneval), [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E)**.
+We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**.



@ -14,7 +14,6 @@ pip install pytorch
 ```


-
 ## 3. Evaluation

 We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1b-python** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs.
@ -35,7 +34,6 @@ python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py
 We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.


-
 #### (1) Multilingual Base Models

 | Model             | Size | Python | C++   | Java | PHP  | TS   | C#   | Bash | JS   | Avg  |
@ -55,13 +53,12 @@ We report experimental results here for 8 main-stream programming languages, **p
 #### (3) Instruction-Tuned Models
 | Model               | Size | Python | C++   | Java | PHP  | TS   | C#   | Bash | JS   | Avg  |
 |---------------------|------|--------|-------|------|------|------|------|------|------|------|
-| ChatGPT             | -    | 70.7%  | 50.3% | 54.5%| 52.2%| 62.3%| 64.6%| 34.8%| 60.9%| 52.2%|
-| GPT-4               | -    | 82.3%  | 70.2% | 74.8%| 70.8%| 73.0%| 77.9%| 51.3%| 83.2%| 72.9%|
+| GPT35-turbo         | -    | 76.2%  | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%|
+| GPT-4               | -    | 84.1%  | 76.4% | 81.6%| 77.2%| 77.4%| 79.1%| 58.2%| 78.0%| 76.5%|
 | WizardCoder         | 16B  | 51.8%  | 41.6% | 41.1%| 42.2%| 44.7%| 46.8%| 12.7%| 42.8%| 40.5%|
 | Phind-CodeLlama     | 34B  | -      | -     | -    | -    | -    | -    | -    | -    | -    |
 | | | | |  |  |  |  |  |  | |
 | OraCoder-Chat (1B)  | 1B  | -      | -     | -    | -    | -    | -    | -    | -    | -    |
-| OraCoder-Chat (7B)  | 7B  | -      | -     | -    | -    | -    | -    | -    | -    | -    |
-| OraCoder-Chat (33B) | 33B | -      | -     | -    | -    | -    | -    | -    | -    | -    |
-
+| OraCoder-Chat (7B)  | 7B  | 78.9%  | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%|
+| OraCoder-Chat (33B) | 33B | 79.3%  | 68.9% | 73.4% | 72.7%| 67.9%| 74.1%| 43.0%| 73.9%| 69.2%|

--- a/README.md
+++ b/README.md
@ -9,7 +9,9 @@

 Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support  project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. 

-<img src="pictures/result.png" alt="result" width="85%">
+<p align="center">
+<img src="pictures/result.png" alt="result" width="80%">
+</p>

 - **Massive Training Data**: Trained on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages.
  
@ -29,7 +31,8 @@ Deepseek Coder comprises a series of code language models trained on both 87% co
 - Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies.
 - Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication.
 - Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability.
- <img src="pictures/data_clean.png" alt="data_creation" width="100%">
+
+<img src="pictures/data_clean.png" alt="data_creation" width="100%">

 #### Model Training

@ -203,9 +206,18 @@ print(tokenizer.decode(outputs[0]))
 ```

 ### 5. Evaluation Results
+We evaluate DeepSeek Coder on various coding-related benchmarks.
+The `passk@1` results on HumanEval (Python and Multilingual), MBPP, DS-1000 are reported as follows:

-The reproducible code for the following evaluation results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory.
+<p align="center">
+<img src="pictures/table.png" alt="table" width="85%">
+</p>

+The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLLama34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000.
+Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B.
+And the DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT35-turbo on HumanEval and achieves comparable result with GPT35-turbo on MBPP.
+
+More evaluation details and reproducible code for above results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory.

 ### 6. Lincense
 This code repository is licensed under the MIT License. The use of DeepSeek Coder model and weights is subject to the Model License. DeepSeek Coder supports commercial use.
--- a/pictures/result.png
+++ b/pictures/result.png
--- a/pictures/table.png
+++ b/pictures/table.png