DeepSeek-Coder/Evaluation/MBPP/README.md

## 1. Introduction

We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks, [**MBPP**](https://huggingface.co/datasets/mbpp), with 3-shot setting.


## 2. Setup

```
pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch
```


## 3. Evaluation

We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1.3b-base** model on the MBPP dataset leveraging **8** GPUs.

```bash
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT} 
```

## 4. Experimental Results

We report experimental results here for several models. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.


#### (1) Multilingual Base Models

| Model             | Size | Pass@1 | 
|-------------------|------|--------|
| CodeShell         | 7B   | 38.6%  | 
| CodeGeeX2         | 6B   | 36.2%  |
| StarCoder     | 16B  | 42.8%  | 
| CodeLLama-Base   | 7B   | 38.6%  | 
| CodeLLama-Base    | 13B  | 47.0%  | 
| CodeLLama-Base    | 34B  | 55.0%  | 
| | | | |  |  |  |  |  |  | |
| DeepSeek-Coder-Base| 1.3B   | 46.8%  |
| DeepSeek-Coder-Base| 5.7B   | 57.2%  | 
| DeepSeek-Coder-Base| 6.7B   | 60.6%  | 
| DeepSeek-Coder-Base|33B  | **66.0%**  |

#### (2) Instruction-Tuned Models
| Model               | Size | Pass@1  |
|---------------------|------|--------|
| GPT-3.5-Turbo            | -    | 70.8%  | 
| GPT-4               | -    | **80.0%**  |
| | | | |  |  |  |  |  |  | |
| DeepSeek-Coder-Instruct | 1.3B  | 49.4%      |
| DeepSeek-Coder-Instruct  | 6.7B  | 65.4%     |
| DeepSeek-Coder-Instruct  | 33B | **70.0%**     |
init project 2023-11-02 10:07:09 -04:00			`## 1. Introduction`

Update README.md 2023-11-02 12:29:22 -04:00			`We provide a test script to evaluate the performance of the deepseek-coder model on code generation benchmarks, [MBPP](https://huggingface.co/datasets/mbpp), with 3-shot setting.`
init project 2023-11-02 10:07:09 -04:00


			`## 2. Setup`

			```
			`pip install accelerate`
			`pip install attrdict`
			`pip install transformers`
			`pip install pytorch`
			```



			`## 3. Evaluation`

			`We've created a sample script, eval.sh, that demonstrates how to test the deepseek-coder-1.3b-base model on the MBPP dataset leveraging 8 GPUs.`

			```bash
			`MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"`
			`DATASET_ROOT="data/"`
			`LANGUAGE="python"`
			`python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT}`
			```

			`## 4. Experimental Results`

			`We report experimental results here for several models. We set the maximum input length to 4096 and the maximum output length to 500, and employ the greedy search strategy.`



			`#### (1) Multilingual Base Models`

			`\| Model \| Size \| Pass@1 \|`
			`\|-------------------\|------\|--------\|`
			`\| CodeShell \| 7B \| 38.6% \|`
			`\| CodeGeeX2 \| 6B \| 36.2% \|`
			`\| StarCoder \| 16B \| 42.8% \|`
			`\| CodeLLama-Base \| 7B \| 38.6% \|`
			`\| CodeLLama-Base \| 13B \| 47.0% \|`
			`\| CodeLLama-Base \| 34B \| 55.0% \|`
			`\| \| \| \| \| \| \| \| \| \| \| \|`
			`\| DeepSeek-Coder-Base\| 1.3B \| 46.8% \|`
			`\| DeepSeek-Coder-Base\| 5.7B \| 57.2% \|`
			`\| DeepSeek-Coder-Base\| 6.7B \| 60.6% \|`
			`\| DeepSeek-Coder-Base\|33B \| 66.0% \|`

			`#### (2) Instruction-Tuned Models`
			`\| Model \| Size \| Pass@1 \|`
			`\|---------------------\|------\|--------\|`
			`\| GPT-3.5-Turbo \| - \| 70.8% \|`
			`\| GPT-4 \| - \| 80.0% \|`
			`\| \| \| \| \| \| \| \| \| \| \| \|`
			`\| DeepSeek-Coder-Instruct \| 1.3B \| 49.4% \|`
			`\| DeepSeek-Coder-Instruct \| 6.7B \| 65.4% \|`
			`\| DeepSeek-Coder-Instruct \| 33B \| 70.0% \|`