e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Setlur; Matthew Y. R. Yang; Charlie Snell; Jeremy Greer; Ian Wu; Virginia Smith; Max Simchowitz; Aviral Kumar

arXiv:2506.09026·cs.LG·June 16, 2025

e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs

Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

PDF

Open Access 3 Reviews

TL;DR

This paper introduces e3, a training method that enables large language models to better explore and extrapolate reasoning capabilities at test time by chaining operations, leveraging gradients, and curriculum design.

Contribution

The paper proposes a novel training recipe e3 that enhances in-context exploration and extrapolation in LLMs, achieving state-of-the-art results on reasoning benchmarks.

Findings

01

e3-1.7B model achieves top scores on AIME'25 and HMMT'25.

02

e3 enables models to extrapolate to twice the training token budget.

03

Improved pass@k performance over base models.

Abstract

Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper tackles an important gap: most open-source “reasoning” LLMs fail to extrapolate to longer test-time compute than they were trained on. 2. It analyzes the verification-generation (VG) asymmetry and argues it is a key driver of RL gains beyond the training length by enabling chains of “generate → verify → revise.” 3. The coupled curriculum is clearly specified (including a concrete budget-selection rule) and empirically outperforms many SOTA models.

Weaknesses

1. In Fig. 8(a), Qwen3-1.7B with naïve RL appears to extrapolate similarly to e3 at some budgets; the large headline gains seem to come primarily from better training-length performance (≤16k). The analogous naïve-RL baseline is missing from Fig. 8(b), which makes cross-dataset claims harder to assess. 2. Figures 8(a) and 9 both report AIME’25 accuracy for Qwen3-1.7B/e3 but show different values and ranges; the relationship between the two protocols (SOTA comparison vs. “wait”-prompt budget

Reviewer 02Rating 8Confidence 4

Strengths

- The paper offers a curriculum based training recipe chaining asymmetries provides remedy to the problem that many open models fail to extrapolate beyond their training budget. The empirical results are strong on the math benchmarks and the choice of Countdown and Multiplication tasks are illustrative in depicting the VG gap. - The ablations and mechanistic analyses are well conducted and support the presented claims, and the theory section also provides good insights. Overall, I believe the pa

Weaknesses

- The paper does not directly measure/report how verification capability or the VG gap evolve during training, which I believe is an important direction to investigate to better understand the verification‑generation training dynamics. - As an ablation study, the paper demonstrates masking negative gradients in GRPO Mask baseline. Yet, masking negatives also changes the update magnitude and how GRPO’s advantage calculation & clipping behave. So, a more controlled study could have provided better

Reviewer 03Rating 4Confidence 3

Strengths

- The concept of "chain of asymmetric skills" offers a compelling explanation for the reasoning trajectory extension during RL. Likewise, the "verification-generation gap" (VG Gap) clearly accounts for the reasoning path that appears in the pattern of solution-verification-solution-.... - The authors trained a state-of-the-art (<2.0B) model on AIME/HMMT 2025 using the suggested method e3. - The paper presents a comprehensive experiments that strongly support the authors' claims.

Weaknesses

- The proposed method "e3" appears to lack novelty. The paper does not suggest a clear approach to determine a proper value of $\kappa$, nor does it include additional experiments on the impact of different $\kappa$ values. Also, the budgets it explores (4k, 8k, 16k) are too coarse, leading to suboptimal performance. - The definition of problem hardness is somewhat ambiguous. For example, the authors define the difficulty level for each dataset separately. - The scalability of e3 remains uncerta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

MethodsBalanced Selection