## 1. Introduction

We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks, [**MBPP**](https://huggingface.co/datasets/mbpp), with 3-shot setting.


## 2. Setup

```
pip install accelerate
pip install attrdict
pip install transformers
pip install pytorch
```


## 3. Evaluation

We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1.3b-base** model on the MBPP dataset leveraging **8** GPUs.

```bash
MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base"
DATASET_ROOT="data/"
LANGUAGE="python"
python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT} 
```

## 4. Experimental Results

We report experimental results here for several models. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**.


#### (1) Multilingual Base Models

| Model             | Size | Pass@1 | 
|-------------------|------|--------|
| CodeShell         | 7B   | 38.6%  | 
| CodeGeeX2         | 6B   | 36.2%  |
| StarCoder     | 16B  | 42.8%  | 
| CodeLLama-Base   | 7B   | 38.6%  | 
| CodeLLama-Base    | 13B  | 47.0%  | 
| CodeLLama-Base    | 34B  | 55.0%  | 
| | | | |  |  |  |  |  |  | |
| DeepSeek-Coder-Base| 1.3B   | 46.8%  |
| DeepSeek-Coder-Base| 5.7B   | 57.2%  | 
| DeepSeek-Coder-Base| 6.7B   | 60.6%  | 
| DeepSeek-Coder-Base|33B  | **66.0%**  |

#### (2) Instruction-Tuned Models
| Model               | Size | Pass@1  |
|---------------------|------|--------|
| GPT-3.5-Turbo            | -    | 70.8%  | 
| GPT-4               | -    | **80.0%**  |
| | | | |  |  |  |  |  |  | |
| DeepSeek-Coder-Instruct | 1.3B  | 49.4%      |
| DeepSeek-Coder-Instruct  | 6.7B  | 65.4%     |
| DeepSeek-Coder-Instruct  | 33B | **70.0%**     |