From eaa34f787c6bf8f977aaf5b43d0f0543ee9bd3c7 Mon Sep 17 00:00:00 2001 From: Daya Guo <40300434+guoday@users.noreply.github.com> Date: Wed, 1 Nov 2023 10:48:55 +0800 Subject: [PATCH] Update README.md --- Evaluation/PAL-Math/README.md | 32 +++++--------------------------- 1 file changed, 5 insertions(+), 27 deletions(-) diff --git a/Evaluation/PAL-Math/README.md b/Evaluation/PAL-Math/README.md index 85fcd9e..0b32cd3 100644 --- a/Evaluation/PAL-Math/README.md +++ b/Evaluation/PAL-Math/README.md @@ -17,7 +17,7 @@ pip install sympy==1.12 pebble timeout-decorator transformers We provide an example of testing the **deepseek-coder-1b-python** model on the **gsm8k** dataset using **8** GPUs. If you wish to use a different model or dataset, you can modify it as needed. ```bash -MODEL_NAME_OR_PATH=deepseek/deepseek-coder-1b-python +MODEL_NAME_OR_PATH=deepseek-ai/deepseek-coder-7b-base DATA=gsm8k # 'math' 'gsm8k' 'gsm-hard' 'svamp' 'tabmwp' 'asdiv' 'mawps' MODEL_DIR_NAME=${MODEL_NAME_OR_PATH##*/} GPU_NUM=8 @@ -45,7 +45,6 @@ We report experimental results here for mathematical reasoning tasks by using py -#### (1) Multilingual Base Models | Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg | | -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- | @@ -56,30 +55,9 @@ We report experimental results here for mathematical reasoning tasks by using py | CodeLLama-Base | 13B | 44.2% | 15.5% | 42.4% | 65.6% | 61.6% | 65.3% | 85.3% | 54.3% | | CodeLLama-Base | 34B | 58.2% | 22.1% | 55.2% | 77.2% | 69.6% | 70.0% | 92.8% | 63.6% | | | | | | | | | | | | -| OraCoder-Base | 1B | 17.0% | 13.4% | 13.3% | 39.2% | 42.4% | 44.8% | 66.0% | 33.7% | -| OraCoder-Base | 7B | 46.0% | 20.6% | 40.0% | 67.2% | 71.2% | 67.1% | 89.1% | 57.3% | -| OraCoder-Base | 33B | - | - | - | - | - | - | - | - | +| DeepSeek-Coder-Base | 1.3B | 15.8% | 16.3% | 14.5% | 38.4% | 28.8% | 51.3% | 66.0% | 33.0% | +| DeepSeek-Coder-MQA-Base | 5.7B | 44.8% | 25.4% | 40.6% | 56.8% | 62.4% | 66.8% | 84.2% | 54.4% | +| DeepSeek-Coder-Base | 6.7B | 46.1% | 25.6% | 40.0% | 67.2% | 71.2% | 69.0% | 89.2% | 58.3% | +| DeepSeek-Coder-Base | 33B | 58.2% | 35.3% | 54.5% | 78.4% | 76.8% | 78.2% | 94.0% | 67.9% | -#### (2) Python Base Models - -| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg | -| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- | -| StarCoder | 16B | 31.5% | 13.8% | 26.7% | 48.8% | 47.2% | 54.9% | 76.1% | 42.7% | -| CodeLLama-Python | 7B | 35.2% | 14.7% | 34.5% | 70.4% | 55.2% | 62.1% | 84.2% | 50.9% | -| CodeLLama-Python | 13B | 44.8% | 17.4% | 45.5% | 65.6% | 60.8% | 69.0% | 89.6% | 56.1% | -| CodeLLama-Python | 34B | 57.6% | 21.1% | 54.5% | 76.8% | 66.8% | 69.5% | 94.2% | 62.9% | -| | | | | | | | | | | -| OraCoder-Python | 1B | 17.6% | 15.0% | 18.2% | 40.0% | 38.4% | 49.5% | 64.1% | 34.7% | -| OraCoder-Python | 7B | 50.3% | 24.3% | 43.0% | 71.2% | 73.6% | 69.7% | 88.0% | 60.0% | -| OraCoder-Python | 33B | - | - | - | - | - | - | - | - | - -#### (3) Instruction-Tuned Models -| Model | Size | GSM8k | MATH | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg | -| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- | -| ChatGPT | - | 78.6% | 38.7% | 67.6% | 77.8% | 79.9% | 81.0% | 89.4% | 73.3% | -| GPT-4 | - | 94.2% | 51.8% | 77.6% | 94.8% | 95.9% | 92.6% | 97.7% | 86.4% | -| | | | | | | | | | | -| OraCoder-Chat | 1B | - | - | - | - | - | - | - | - | -| OraCoder-Chat | 7B | - | - | - | - | - | - | - | - | -| OraCoder-Chat | 33B | - | - | - | - | - | - | - | - |