e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
Amrith Setlur, Matthew Y. R. Yang, Charlie Snell, Jeremy Greer, Ian Wu, Virginia Smith, Max Simchowitz, Aviral Kumar

TL;DR
This paper introduces e3, a training method that enables large language models to better explore and extrapolate reasoning capabilities at test time by chaining operations, leveraging gradients, and curriculum design.
Contribution
The paper proposes a novel training recipe e3 that enhances in-context exploration and extrapolation in LLMs, achieving state-of-the-art results on reasoning benchmarks.
Findings
e3-1.7B model achieves top scores on AIME'25 and HMMT'25.
e3 enables models to extrapolate to twice the training token budget.
Improved pass@k performance over base models.
Abstract
Test-time scaling offers a promising path to improve LLM reasoning by utilizing more compute at inference time; however, the true promise of this paradigm lies in extrapolation (i.e., improvement in performance on hard problems as LLMs keep "thinking" for longer, beyond the maximum token budget they were trained on). Surprisingly, we find that most existing reasoning models do not extrapolate well. We show that one way to enable extrapolation is by training the LLM to perform in-context exploration: training the LLM to effectively spend its test time budget by chaining operations (such as generation, verification, refinement, etc.), or testing multiple hypotheses before it commits to an answer. To enable in-context exploration, we identify three key ingredients as part of our recipe e3: (1) chaining skills that the base LLM has asymmetric competence in, e.g., chaining verification…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles an important gap: most open-source “reasoning” LLMs fail to extrapolate to longer test-time compute than they were trained on. 2. It analyzes the verification-generation (VG) asymmetry and argues it is a key driver of RL gains beyond the training length by enabling chains of “generate → verify → revise.” 3. The coupled curriculum is clearly specified (including a concrete budget-selection rule) and empirically outperforms many SOTA models.
1. In Fig. 8(a), Qwen3-1.7B with naïve RL appears to extrapolate similarly to e3 at some budgets; the large headline gains seem to come primarily from better training-length performance (≤16k). The analogous naïve-RL baseline is missing from Fig. 8(b), which makes cross-dataset claims harder to assess. 2. Figures 8(a) and 9 both report AIME’25 accuracy for Qwen3-1.7B/e3 but show different values and ranges; the relationship between the two protocols (SOTA comparison vs. “wait”-prompt budget
- The paper offers a curriculum based training recipe chaining asymmetries provides remedy to the problem that many open models fail to extrapolate beyond their training budget. The empirical results are strong on the math benchmarks and the choice of Countdown and Multiplication tasks are illustrative in depicting the VG gap. - The ablations and mechanistic analyses are well conducted and support the presented claims, and the theory section also provides good insights. Overall, I believe the pa
- The paper does not directly measure/report how verification capability or the VG gap evolve during training, which I believe is an important direction to investigate to better understand the verification‑generation training dynamics. - As an ablation study, the paper demonstrates masking negative gradients in GRPO Mask baseline. Yet, masking negatives also changes the update magnitude and how GRPO’s advantage calculation & clipping behave. So, a more controlled study could have provided better
- The concept of "chain of asymmetric skills" offers a compelling explanation for the reasoning trajectory extension during RL. Likewise, the "verification-generation gap" (VG Gap) clearly accounts for the reasoning path that appears in the pattern of solution-verification-solution-.... - The authors trained a state-of-the-art (<2.0B) model on AIME/HMMT 2025 using the suggested method e3. - The paper presents a comprehensive experiments that strongly support the authors' claims.
- The proposed method "e3" appears to lack novelty. The paper does not suggest a clear approach to determine a proper value of $\kappa$, nor does it include additional experiments on the impact of different $\kappa$ values. Also, the budgets it explores (4k, 8k, 16k) are too coarse, leading to suboptimal performance. - The definition of problem hardness is somewhat ambiguous. For example, the authors define the difficulty level for each dataset separately. - The scalability of e3 remains uncerta
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
MethodsBalanced Selection
