An Experimental Audit of DeepSeek-R1: What the Missing Error Bars Tell Us About Emergent Reasoning via Pure RL

Abstract

DeepSeek-R1 [DeepSeek-AI, 2025] (arXiv:2501.12948) makes a striking claim: reinforcement learning alone, without supervised fine-tuning on chain-of-thought demonstrations, can elicit sophisticated reasoning in large language models. The paper introduces DeepSeek-R1-Zero, trained via Group Relative Policy Optimization (GRPO) directly on a base model, and DeepSeek-R1, which layers on cold-start SFT and a multi-stage training pipeline. Both models report performance competitive with OpenAI's o1 across mathematical reasoning, coding, and general benchmarks. The central finding, that emergent self-verification and reflection arise from outcome-based RL alone, would represent a genuine shift in how we train reasoning systems. Yet a careful audit of the experimental methodology reveals significant gaps: key benchmarks have sample sizes too small for reliable comparison, baselines are not controlled for compute or data, critical ablations are missing, and no results include variance estimates. The contribution is conceptually important. Whether the empirical evidence fully supports the claims as stated is a different question entirely.

Three Contributions, Three Different Levels of Novelty

The paper makes three distinct contributions that deserve separate evaluation.

First, DeepSeek-R1-Zero demonstrates that applying GRPO directly to a base model (DeepSeek-V3-Base, a 671B mixture-of-experts architecture) produces outputs exhibiting self-verification and extended reasoning chains. This is the most novel claim. It challenges the orthodoxy established by Ouyang et al. [2022] that SFT is a necessary precondition for effective RL.

Second, DeepSeek-R1 itself is an engineering contribution: a four-stage training pipeline (cold-start SFT, reasoning-focused RL, rejection sampling with SFT, full-scenario RL) that yields a model competitive with o1 on multiple benchmarks.

Third, the paper shows that distilling R1's reasoning outputs into smaller models (1.5B to 70B parameters, based on Qwen and Llama architectures) transfers reasoning capability effectively, often outperforming those same smaller models trained with RL directly.

The first contribution qualifies as a significant empirical finding. The second amounts to engineering refinement. The third is useful but expected given prior distillation literature. Novelty rating: moderate to significant, weighted almost entirely by the R1-Zero finding.

How GRPO Works, and What It Trades Away

The core algorithmic component is GRPO, introduced in DeepSeek-Math [Shao et al. 2024]. For each prompt $q$ , GRPO samples a group of $G$ completions from the current policy $π_{θ_{o l d}}$ , scores them with a reward function, and computes group-normalized advantages:

\hat{A}_{i} = \frac{r _{i} - μ _{G}}{σ _{G}}

where $μ_{G}$ and $σ_{G}$ are the mean and standard deviation of rewards within the sampled group. The policy is then updated using a clipped surrogate objective with a KL penalty:

J (θ) = E [\frac{1}{G} i = 1 \sum G min (ρ_{i} \hat{A}_{i}, clip (ρ_{i}, 1 - ϵ, 1 + ϵ) \hat{A}_{i})] - β D_{K L} (π_{θ} ∥ π_{r e f})

where $ρ_{i} = π_{θ} (o_{i} ∣ q) / π_{θ_{o l d}} (o_{i} ∣ q)$ .

The key advantage over standard PPO [Schulman et al. 2017] is the elimination of the critic network. At 671B parameters, this is no minor convenience, it represents substantial memory savings at training time. But the trade-off is real: group-normalized advantages are noisier than learned value baselines, particularly when $G$ is small or the reward distribution is multimodal within a group. The paper reports neither the group size $G$ used in final training nor any sensitivity analysis of this hyperparameter.

The reward signal for R1-Zero is binary correctness for math (verified against ground truth) and execution-based for code. A small format reward encourages the model to place its final answer within designated tags. No process reward model is used, only outcome-based reward. This matters: the model must discover useful intermediate reasoning steps entirely through credit assignment from terminal rewards, a notoriously difficult problem in RL [Sutton and Barto, 2018].

Claims vs. Evidence: A Careful Accounting

The paper's central claims deserve individual scrutiny against the evidence provided.

Claim 1: Pure RL induces emergent reasoning without SFT. The evidence rests on the R1-Zero experiment. The model does produce longer chain-of-thought outputs and exhibits patterns resembling self-verification. But "emergent" is doing heavy lifting here. DeepSeek-V3-Base was pretrained on a massive corpus that almost certainly contains chain-of-thought-style reasoning: textbooks, worked solutions, forum posts with step-by-step derivations. The RL signal doesn't create reasoning from nothing, it selects for reasoning patterns already latent in the base model's distribution. The authors implicitly assume that the base model carries no reasoning priors, an assumption that crumbles under scrutiny of modern pretraining corpora. A cleaner test would use a base model pretrained only on text devoid of step-by-step solutions. This is impractical at scale, but the assumption should be stated explicitly.

Claim 2: R1 matches o1 on key benchmarks. The reported numbers are competitive, but the comparison is uncontrolled. Different base models, different pretraining data, different compute budgets, different evaluation protocols. The claimed margins are narrow enough that protocol differences alone could explain them. More on this below.

Claim 3: Distillation from R1 to smaller models outperforms direct RL on small models. This is the best-supported claim, with controlled comparisons between distilled and RL-trained versions of identical base architectures. The improvement from distillation is consistent across model sizes.

The Baseline Problem

The devil lives in the evaluation protocol, and here the baselines raise several concerns.

The primary comparison target is OpenAI's o1. But the authors had no access to o1's training data, architecture details, or compute budget. This reduces the R1 vs. o1 comparison to a measurement of final outputs, not of methods. To evaluate whether GRPO drives R1's gains, we need comparisons that hold other variables constant. A fairer baseline set would include: (a) the same base model with standard SFT-then-PPO [Ouyang et al. 2022], (b) the same base model with SFT only and no RL phase, and (c) the same base model with alternative RL algorithms such as REINFORCE or vanilla PPO. The paper provides some of these comparisons for R1-Zero but not for the full R1 pipeline.

At least one baseline was improperly chosen. The paper compares distilled small models against QwQ-32B-Preview as a reasoning baseline, but QwQ was a preview release, not tuned to compete on identical benchmarks under comparable compute. Notable omissions include STaR-style iterative self-training [Zelikman et al. 2022], which achieves reasoning improvement through supervised learning on self-generated correct traces without any RL, and process reward model approaches [Lightman et al. 2023], which offer an alternative credit assignment mechanism.

Five Missing Ablations That Would Change the Story

The absent ablation that would most strengthen this paper is a systematic decomposition of the four-stage R1 pipeline. Which stage contributes what?

1. Cold-start SFT alone vs. full pipeline. How much of R1's final performance comes from the initial thousands of curated CoT examples? If cold-start SFT already captures 80% of the gain, the narrative shifts considerably.

2. Stage 2 alone vs. stages 2+3+4. Does rejection sampling followed by a second SFT round provide genuine reasoning improvement, or does it primarily smooth formatting and readability?

3. Group size $G$ sensitivity. The variance of the advantage estimate scales as $O (1/ G)$ , so small groups yield noisy gradients. What is the minimum group size for stable training at this scale?

4. KL coefficient $\beta$ sensitivity. The paper mentions that R1-Zero exhibits language mixing when the KL constraint is too weak but provides no systematic study of the trade-off curve.

5. Reward function ablation. What happens with process-based rewards [Uesato et al. 2022] instead of outcome-based? How much does the format reward contribute independently?

None of these ablations appear for the full R1 pipeline.

Where Are the Error Bars?

This is perhaps the most consequential methodological concern. There are no variance estimates anywhere in the paper.

AIME 2024 contains 30 problems. At 79.8% accuracy, the binomial 95% confidence interval spans approximately [62%, 92%]. For o1 at 79.2%, the interval is roughly [61%, 91%]. These intervals overlap almost entirely.

Benchmark	R1 (reported)	o1 (reported)	$n$	Approx. 95% CI width
AIME 2024	79.8%	79.2%	30	±15%
MATH-500	97.3%	96.4%	500	±1.5%
Codeforces	96.3 pctl	96.6 pctl	varies	unreported

MATH-500 is the only benchmark with a sample size large enough for meaningful comparison, and even there the difference (0.9 percentage points) may not clear the noise threshold. The paper reports pass@1 numbers without specifying how many evaluation runs were conducted. For stochastic decoding, pass@1 varies across runs. Without the number of seeds and standard deviations, these are point estimates of unknown reliability.

The R1-Zero result on AIME is more dramatic: improvement from approximately 15.6% to 71.0% pass@1. On 30 problems, even a 55-point swing is statistically significant. But the consensus@64 result of 86.7% conflates the RL-trained policy's quality with the benefit of majority voting, a technique that improves any model [Wang et al. 2023]. Disentangling these two contributions is essential.

Data Contamination, Cherry-Picked Examples, and Narrow Domains

Three evaluation issues warrant close scrutiny.

Data contamination. AIME 2024 problems were publicly available before DeepSeek-V3's pretraining data cutoff. The paper describes no decontamination procedures for the base model's pretraining corpus. If AIME 2024 problems, or close variants, appeared in pretraining, the R1-Zero results overestimate RL's contribution. This concern extends to MATH-500, which draws from widely circulated competition problem sets [Hendrycks et al. 2021].

Qualitative evidence masquerading as quantitative. The paper highlights R1-Zero's self-verification and reflection behaviors, but these are assessed through selected examples rather than measured systematically across the test set. How often does the model self-verify? When it does, does accuracy improve? The celebrated "aha moment" is a single cherry-picked instance. Without systematic behavioral analysis, it remains anecdotal.

Narrow domain coverage. The evaluation concentrates on math and code, where outcome verification is clean and binary. Generalization to domains with noisy or subjective reward signals, open-ended reasoning, ethical dilemmas, multi-step planning under uncertainty, is untested but implicitly claimed.

How This Fits the Literature

The pure-RL-for-reasoning agenda connects to several research threads worth distinguishing carefully.

Zelikman et al. [2022] proposed STaR, bootstrapping reasoning by training on the model's own correct chain-of-thought traces. This is conceptually adjacent to R1-Zero but uses supervised learning on filtered self-generated data rather than RL. The comparison matters: STaR achieves reasoning improvement without RL at all, raising the question of whether GRPO is the active ingredient or whether any selection mechanism favoring correct reasoning suffices.

Lightman et al. [2023] demonstrated that process reward models, rewarding intermediate steps rather than only final answers, significantly improve mathematical reasoning. R1 deliberately avoids process rewards, relying solely on outcome-based feedback. Whether this design choice is principled or merely convenient remains an open question the paper does not address.

The broader precedent is AlphaGo Zero [Silver et al. 2017], which demonstrated that RL without human demonstrations can exceed human performance. The analogy is compelling but the domain gap is vast: Go has a perfect simulator and unambiguous outcomes. Language reasoning operates in a space of subjective rewards with no ground-truth verifier for most tasks outside mathematics and competitive programming.

What It Means If the Core Claim Holds

If the central finding withstands replication, the implications are substantial. It would mean the bottleneck for reasoning capability shifts from curated human demonstrations to compute for RL training and reward signal quality. This favors well-resourced labs and could accelerate capability development beyond current scaling projections.

The distillation results compound this concern. If strong reasoning transfers reliably to 7B-parameter models, restricting access to reasoning-capable systems becomes far harder. The paper's release of distilled model weights is laudable for reproducibility but creates an irreversible distribution of capable reasoning models.

Why No One Can Reproduce This

Could an independent team replicate these results? Almost certainly not.

The paper does not specify the exact composition of cold-start SFT data (described as "thousands" of examples with vague selection criteria), the group size $G$ in GRPO, the learning rate schedule and batch size for RL training, the number of RL training steps, or the rejection sampling threshold used in stage 3. The compute cost for training R1 on DeepSeek-V3-Base goes unreported but likely reaches tens of millions of GPU-hours.

Reproducibility scores:

Code availability: 2/5. Inference weights released, but no training code or pipeline scripts. Reproducing the RL training is infeasible without the full infrastructure.
Data availability: 2/5. Evaluation benchmarks are public. Cold-start SFT data and the full RL training data distribution are proprietary and unspecified.
Experimental detail: 2/5. Key hyperparameters incompletely specified. Compute requirements unreported. Evaluation protocol details (number of seeds, temperature, sampling strategy for pass@1) insufficient for exact replication.

Five Questions the Authors Should Answer

1. What group size $G$ was used in GRPO, and how sensitive are the R1-Zero results to this parameter?

2. Can pass@1 results be reported as means over at least 5 random seeds with standard deviations, particularly for AIME 2024?

3. What decontamination procedure was applied to ensure AIME 2024 and MATH benchmark problems were absent from the pretraining corpus?

4. How does R1-Zero compare to STaR-style iterative self-training [Zelikman et al. 2022] applied to the same base model with the same compute budget?

5. What fraction of R1-Zero's outputs across the full test set exhibit the self-verification behavior highlighted in the qualitative examples?

Verdict: A Compelling Idea Undermined by Its Own Evidence

The paper presents a conceptually important finding wrapped in experimental methodology that does not meet the standard of evidence its claims require. The qualitative insight, that pure RL can surface reasoning-like behaviors from a strong base model, is valuable and likely directionally correct. The R1-Zero training curve showing progressive development of longer, more structured reasoning chains is genuinely compelling as a demonstration.

But the specific quantitative claims of matching o1 are not reliably established. Sample sizes on flagship benchmarks are too small, no variance estimates appear anywhere, baselines are uncontrolled for compute or data, and critical ablations are absent. The four-stage R1 pipeline is presented as a monolith with no decomposition of per-stage contribution.

Strength of empirical evidence: moderate. Strong for the qualitative finding that RL induces reasoning-like behavior. Weak for the specific numerical claims of competitive performance with o1.

Researchers building on this work should treat the qualitative mechanistic insight as the primary contribution and approach the benchmark tables with appropriate skepticism. For the field, the prescription is straightforward: report confidence intervals, run multiple seeds, control your baselines, and decompose your pipeline. Reproducibility is not optional, it is the minimum.

Reproducibility & Sources

Primary paper: DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025.

Code repository: No official training code released. Model weights (R1 and distilled variants) available via Hugging Face. Inference code provided.

Datasets:

AIME 2024: 30-problem competition set, publicly available
MATH-500: Subset of MATH benchmark [Hendrycks et al. 2021], publicly available
Codeforces: Public competitive programming platform
Cold-start SFT data: Proprietary, not released

Reproducibility assessment:

Code availability: 2/5. Weights released, no training pipeline. Full reproduction infeasible.
Data availability: 2/5. Evaluation benchmarks public. Training data proprietary and underspecified.
Experimental detail: 2/5. Hyperparameters incomplete. Compute costs unreported. Evaluation protocol underspecified for exact replication.