Update README.md

2025-07-10 11:18:58 -04:00 · 2023-11-01 10:48:55 +08:00 · 2023-11-01 10:48:55 +08:00 · eaa34f787c
commit eaa34f787c
parent e0bbc1b808
1 changed files with 5 additions and 27 deletions
--- a/Evaluation/PAL-Math/README.md
+++ b/Evaluation/PAL-Math/README.md
@ -17,7 +17,7 @@ pip install sympy==1.12 pebble timeout-decorator transformers
 We provide an example of testing the **deepseek-coder-1b-python** model on the **gsm8k** dataset using **8** GPUs. If you wish to use a different model or dataset, you can modify it as needed.

 ```bash
-MODEL_NAME_OR_PATH=deepseek/deepseek-coder-1b-python
+MODEL_NAME_OR_PATH=deepseek-ai/deepseek-coder-7b-base
 DATA=gsm8k # 'math' 'gsm8k' 'gsm-hard' 'svamp' 'tabmwp' 'asdiv' 'mawps'
 MODEL_DIR_NAME=${MODEL_NAME_OR_PATH##*/}
 GPU_NUM=8
@ -45,7 +45,6 @@ We report experimental results here for mathematical reasoning tasks by using py



-#### (1) Multilingual Base Models

 | Model          | Size | GSM8k | MATH  | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg   |
 | -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
@ -56,30 +55,9 @@ We report experimental results here for mathematical reasoning tasks by using py
 | CodeLLama-Base | 13B  | 44.2% | 15.5% | 42.4%    | 65.6% | 61.6%  | 65.3% | 85.3% | 54.3% |
 | CodeLLama-Base | 34B  | 58.2% | 22.1% | 55.2%    | 77.2% | 69.6%  | 70.0% | 92.8% | 63.6% |
 |                |      |       |       |          |       |        |       |       |       |
-| OraCoder-Base  | 1B   | 17.0% | 13.4% | 13.3%    | 39.2% | 42.4%  | 44.8% | 66.0% | 33.7% |
-| OraCoder-Base  | 7B   | 46.0% | 20.6% | 40.0%    | 67.2% | 71.2%  | 67.1% | 89.1% | 57.3% |
-| OraCoder-Base  | 33B  | -     | -     | -        | -     | -      | -     | -     | -     |
+| DeepSeek-Coder-Base  | 1.3B   | 15.8% | 16.3% | 14.5%    | 38.4% | 28.8%  | 51.3% | 66.0% | 33.0% |
+| DeepSeek-Coder-MQA-Base  | 5.7B   | 44.8% | 25.4% | 40.6%    | 56.8% | 62.4%  | 66.8% | 84.2% | 54.4% |
+| DeepSeek-Coder-Base  | 6.7B   | 46.1% | 25.6% | 40.0%    | 67.2% | 71.2%  | 69.0% | 89.2% | 58.3% |
+| DeepSeek-Coder-Base  | 33B  | 58.2% | 35.3% | 54.5%    | 78.4% | 76.8%  | 78.2% | 94.0% | 67.9% |

-#### (2) Python Base Models
-
-| Model          | Size | GSM8k | MATH  | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg   |
-| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
-| StarCoder          | 16B  | 31.5% | 13.8% | 26.7%    | 48.8% | 47.2%  | 54.9% | 76.1% | 42.7% |
-| CodeLLama-Python   | 7B   | 35.2% | 14.7% | 34.5%    | 70.4% | 55.2%  | 62.1% | 84.2% | 50.9% |
-| CodeLLama-Python   | 13B  | 44.8% | 17.4% | 45.5%    | 65.6% | 60.8%  | 69.0% | 89.6% | 56.1% |
-| CodeLLama-Python   | 34B  | 57.6% | 21.1% | 54.5%    | 76.8% | 66.8%  | 69.5% | 94.2% | 62.9% |
-|  |  |  |  |  |  |  |  |  |  |
-| OraCoder-Python    | 1B   | 17.6% | 15.0% | 18.2%    | 40.0% | 38.4%  | 49.5% | 64.1% | 34.7% |
-| OraCoder-Python    | 7B   | 50.3% | 24.3% | 43.0%    | 71.2% | 73.6%  | 69.7% | 88.0% | 60.0% |
-| OraCoder-Python    | 33B  | -     | -     | -        | -     | -      | -     | -     | -     |
-
-#### (3) Instruction-Tuned Models
-| Model          | Size | GSM8k | MATH  | GSM-Hard | SVAMP | TabMWP | ASDiv | MAWPS | Avg   |
-| -------------- | ---- | ----- | ----- | -------- | ----- | ------ | ----- | ----- | ----- |
-| ChatGPT            | -    | 78.6% | 38.7% | 67.6%    | 77.8% | 79.9%  | 81.0% | 89.4% | 73.3% |
-| GPT-4              | -    | 94.2% | 51.8% | 77.6%    | 94.8% | 95.9%  | 92.6% | 97.7% | 86.4% |
-|               |      |       |       |          |       |        |       |       |       |
-| OraCoder-Chat      | 1B   | -     | -     | -        | -     | -      | -     | -     | -     |
-| OraCoder-Chat      | 7B   | -     | -     | -        | -     | -      | -     | -     | -     |
-| OraCoder-Chat      | 33B  | -     | -     | -        | -     | -      | -     | -     | -     |