From 30f7c85c753cb2b71b4b5ec5d5f831f137498a5d Mon Sep 17 00:00:00 2001 From: Benjamin Winkler Date: Fri, 7 Feb 2025 00:58:05 -0500 Subject: [PATCH] Update README.md to correct grammatical tense Minor changes to correct grammatical tense for activities that took place in the past. --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d9d4ccf..a7af946 100644 --- a/README.md +++ b/README.md @@ -65,7 +65,7 @@ To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSe **Post-Training: Large-Scale Reinforcement Learning on the Base Model** -- We directly apply reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. +- We directly applied reinforcement learning (RL) to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeek-R1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. - We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model's reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models. @@ -109,7 +109,7 @@ For more details regarding the model architecture, please refer to [DeepSeek-V3] DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1. -We slightly change their configs and tokenizers. Please use our setting to run these models. +We slightly changed their configs and tokenizers. Please use our setting to run these models. ## 4. Evaluation Results