diff --git a/README_zh.md b/README_zh.md index 5c3e901..18a67dc 100644 --- a/README_zh.md +++ b/README_zh.md @@ -116,27 +116,27 @@ DeepSeek-R1-Distill 模型基于开源模型微调,使用 DeepSeek-R1 生成 | | 架构 | - | - | MoE | - | - | MoE | | | 激活参数量 | - | - | 37B | - | - | 37B | | | 总参数量 | - | - | 671B | - | - | 671B | -| 英文 | MMLU (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | -| | MMLU-Redux (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | -| | MMLU-Pro (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | -| | DROP (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | -| | IF-Eval (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | -| | GPQA-Diamond (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | -| | SimpleQA (正确率) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | +| 英文 | 大规模多任务语言理解 (Pass@1) | 88.3 | 87.2 | 88.5 | 85.2 | **91.8** | 90.8 | +| | 大规模多任务语言理解 redux集 (EM) | 88.9 | 88.0 | 89.1 | 86.7 | - | **92.9** | +| | 大规模多任务语言理解Pro集 (EM) | 78.0 | 72.6 | 75.9 | 80.3 | - | **84.0** | +| | 段落级离散推理 (3-shot F1) | 88.3 | 83.7 | 91.6 | 83.9 | 90.2 | **92.2** | +| | IF-Eval聚焦可验证指令评估 (Prompt Strict) | **86.5** | 84.3 | 86.1 | 84.8 | - | 83.3 | +| | 研究生级的google问答基准 (Pass@1) | 65.0 | 49.9 | 59.1 | 60.0 | **75.7** | 71.5 | +| | open AI SimpleQA评估 (正确率) | 28.4 | 38.2 | 24.9 | 7.0 | **47.0** | 30.1 | | | FRAMES (准确率) | 72.5 | 80.5 | 73.3 | 76.9 | - | **82.5** | -| | AlpacaEval2.0 (LC胜率) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | -| | ArenaHard (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | -| 代码 | LiveCodeBench (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | -| | Codeforces (百分位) | 20.3 | 23.6 | 58.7 | 93.4 | **96.6** | 96.3 | -| | Codeforces (评分) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | +| | Tatsu Lab的AlpacaEval2.0指令遵循语言模型的自动评估(LC胜率) | 52.0 | 51.1 | 70.0 | 57.8 | - | **87.6** | +| | ArenaHard基准 (GPT-4-1106) | 85.2 | 80.4 | 85.5 | 92.0 | - | **92.3** | +| 代码 | LiveCodeBench编码基准 (Pass@1-COT) | 33.8 | 34.2 | - | 53.8 | 63.4 | **65.9** | +| | Codeforces基准 | 20.3% | 23.6% | 58.7% | 93.4 | **96.6%** | 96.3% | +| | Codeforces基准 (分数) | 717 | 759 | 1134 | 1820 | **2061** | 2029 | | | SWE Verified (解决率) | **50.8** | 38.8 | 42.0 | 41.6 | 48.9 | 49.2 | | | Aider-Polyglot (准确率) | 45.3 | 16.0 | 49.6 | 32.9 | **61.7** | 53.3 | -| 数学 | AIME 2024 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | -| | MATH-500 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | -| | CNMO 2024 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | -| 中文 | CLUEWSC (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | -| | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | -| | C-SimpleQA (正确率) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 | +| 数学 | 美国数学邀请赛 2024届 (Pass@1) | 16.0 | 9.3 | 39.2 | 63.6 | 79.2 | **79.8** | +| | MATH-500数学问题集 (Pass@1) | 78.3 | 74.6 | 90.2 | 90.0 | 96.4 | **97.3** | +| | 中国数学奥林匹克竞赛 2024届 (Pass@1) | 13.1 | 10.8 | 43.2 | 67.6 | - | **78.8** | +| 中文 | CLUEWSC中文语言理解测评基准 (EM) | 85.4 | 87.9 | 90.9 | 89.9 | - | **92.8** | +| | 中文大模型评估基准 (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** | +| | C-SimpleQA大型语言模型的中文事实评价集 (正确率) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |