## 1. Introduction We provide a test script to evaluate the performance of the **deepseek-coder** model on code generation benchmarks, [**MBPP**](https://huggingface.co/datasets/mbpp), with 3-shot setting. ## 2. Setup ``` pip install accelerate pip install attrdict pip install transformers pip install pytorch ``` ## 3. Evaluation We've created a sample script, **eval.sh**, that demonstrates how to test the **deepseek-coder-1.3b-base** model on the MBPP dataset leveraging **8** GPUs. ```bash MODEL_NAME_OR_PATH="deepseek-ai/deepseek-coder-1.3b-base" DATASET_ROOT="data/" LANGUAGE="python" python -m accelerate.commands.launch --config_file test_config.yaml eval_pal.py --logdir ${MODEL_NAME_OR_PATH} --dataroot ${DATASET_ROOT} ``` ## 4. Experimental Results We report experimental results here for several models. We set the maximum input length to **4096** and the maximum output length to **500**, and employ the **greedy search strategy**. #### (1) Multilingual Base Models | Model | Size | Pass@1 | |-------------------|------|--------| | CodeShell | 7B | 38.6% | | CodeGeeX2 | 6B | 36.2% | | StarCoder | 16B | 42.8% | | CodeLLama-Base | 7B | 38.6% | | CodeLLama-Base | 13B | 47.0% | | CodeLLama-Base | 34B | 55.0% | | | | | | | | | | | | | | DeepSeek-Coder-Base| 1.3B | 46.8% | | DeepSeek-Coder-Base| 5.7B | 57.2% | | DeepSeek-Coder-Base| 6.7B | 60.6% | | DeepSeek-Coder-Base|33B | **66.0%** | #### (2) Instruction-Tuned Models | Model | Size | Pass@1 | |---------------------|------|--------| | GPT-3.5-Turbo | - | 70.8% | | GPT-4 | - | **80.0%** | | | | | | | | | | | | | | DeepSeek-Coder-Instruct | 1.3B | 49.4% | | DeepSeek-Coder-Instruct | 6.7B | 65.4% | | DeepSeek-Coder-Instruct | 33B | **70.0%** |