diff --git a/Evaluation/HumanEval/README.md b/Evaluation/HumanEval/README.md index 1c8888a..cd91793 100644 --- a/Evaluation/HumanEval/README.md +++ b/Evaluation/HumanEval/README.md @@ -1,6 +1,6 @@ ## 1. Introduction -We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval](https://huggingface.co/datasets/openai_humaneval), [MultiPL-E](https://huggingface.co/datasets/nuprl/MultiPL-E)**. +We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks. We select the widely-used benchmarks: **[HumanEval-Python](https://huggingface.co/datasets/openai_humaneval), [HumanEval-Multilingual](https://huggingface.co/datasets/nuprl/MultiPL-E)**. @@ -14,7 +14,6 @@ pip install pytorch ``` - ## 3. Evaluation We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1b-python** model on the HumanEval dataset leveraging **8** GPUs. If your use case involves a different model or dataset, simply adjust the script to fit your needs. @@ -35,7 +34,6 @@ python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py We report experimental results here for 8 main-stream programming languages, **python**, **c++**, **java**, **PHP**, **TypeScript**, **C#**, **Bash**, and **JavaScript**. For all open-source models, we utilize this repository to obtain the performance of the models on the HumanEval dataset. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**. - #### (1) Multilingual Base Models | Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg | @@ -55,13 +53,12 @@ We report experimental results here for 8 main-stream programming languages, **p #### (3) Instruction-Tuned Models | Model | Size | Python | C++ | Java | PHP | TS | C# | Bash | JS | Avg | |---------------------|------|--------|-------|------|------|------|------|------|------|------| -| ChatGPT | - | 70.7% | 50.3% | 54.5%| 52.2%| 62.3%| 64.6%| 34.8%| 60.9%| 52.2%| -| GPT-4 | - | 82.3% | 70.2% | 74.8%| 70.8%| 73.0%| 77.9%| 51.3%| 83.2%| 72.9%| +| GPT35-turbo | - | 76.2% | 63.4% | 69.2%| 60.9%| 69.1%| 70.8%| 42.4%| 67.1%| 64.9%| +| GPT-4 | - | 84.1% | 76.4% | 81.6%| 77.2%| 77.4%| 79.1%| 58.2%| 78.0%| 76.5%| | WizardCoder | 16B | 51.8% | 41.6% | 41.1%| 42.2%| 44.7%| 46.8%| 12.7%| 42.8%| 40.5%| | Phind-CodeLlama | 34B | - | - | - | - | - | - | - | - | - | | | | | | | | | | | | | | OraCoder-Chat (1B) | 1B | - | - | - | - | - | - | - | - | - | -| OraCoder-Chat (7B) | 7B | - | - | - | - | - | - | - | - | - | -| OraCoder-Chat (33B) | 33B | - | - | - | - | - | - | - | - | - | - +| OraCoder-Chat (7B) | 7B | 78.9% | 63.4% | 68.4% | 68.9%| 67.2%| 72.8%| 36.7%| 72.7%| 66.1%| +| OraCoder-Chat (33B) | 33B | 79.3% | 68.9% | 73.4% | 72.7%| 67.9%| 74.1%| 43.0%| 73.9%| 69.2%| diff --git a/README.md b/README.md index f2ccc9b..15645de 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,9 @@ Deepseek Coder comprises a series of code language models trained on both 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2T tokens. We provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and a extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, Deepseek Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks. -result +

+result +

- **Massive Training Data**: Trained on 2T tokens, including 87% code and 13% linguistic data in both English and Chinese languages. @@ -29,7 +31,8 @@ Deepseek Coder comprises a series of code language models trained on both 87% co - Step 2: Parsing the dependencies of files within the same repository to rearrange the file positions based on their dependencies. - Step 3: Concatenating dependent files to form a single example and employ repo-level minhash for deduplication. - Step 4: Further filtering out low-quality code, such as codes with syntax errors or poor readability. -- data_creation + +data_creation #### Model Training @@ -203,9 +206,18 @@ print(tokenizer.decode(outputs[0])) ``` ### 5. Evaluation Results +We evaluate DeepSeek Coder on various coding-related benchmarks. +The `passk@1` results on HumanEval (Python and Multilingual), MBPP, DS-1000 are reported as follows: -The reproducible code for the following evaluation results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory. +

+table +

+The result shows that DeepSeek-Coder-Base-33B significantly outperforms existing open-source code LLMs. Compared with CodeLLama34B, it leads by 7.9%, 9.3%, 10.8% and 5.9% respectively on HumanEval Python, HumanEval Multilingual, MBPP and DS-1000. +Surprisingly, our DeepSeek-Coder-Base-7B reaches the performance of CodeLlama-34B. +And the DeepSeek-Coder-Instruct-33B model after instruction tuning outperforms GPT35-turbo on HumanEval and achieves comparable result with GPT35-turbo on MBPP. + +More evaluation details and reproducible code for above results can be found in the [Evaluation](https://github.com/deepseek-ai/deepseek-coder/tree/main/Evaluation) directory. ### 6. Lincense This code repository is licensed under the MIT License. The use of DeepSeek Coder model and weights is subject to the Model License. DeepSeek Coder supports commercial use. diff --git a/pictures/result.png b/pictures/result.png index 75bf7dd..366c7bc 100644 Binary files a/pictures/result.png and b/pictures/result.png differ diff --git a/pictures/table.png b/pictures/table.png new file mode 100644 index 0000000..1c9d6f3 Binary files /dev/null and b/pictures/table.png differ