First Finish Search: Efficient Test-Time Scaling in Large Language Models
Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty

TL;DR
This paper introduces First Finish Search (FFS), a simple, training-free decoding method that improves reasoning accuracy in large language models by early stopping once the first completion finishes, reducing inference costs.
Contribution
The paper proposes FFS, a novel parallel decoding strategy that leverages early stopping to enhance reasoning performance without additional training or complex heuristics.
Findings
FFS achieves 82.23% accuracy on AIME datasets, a 15% improvement.
FFS nearly matches OpenAI's o4-mini performance.
Theoretical analysis explains the effectiveness of early stopping in reasoning tasks.
Abstract
Test-time scaling (TTS), which involves dynamic allocation of compute during inference, offers a promising way to improve reasoning in large language models. While existing TTS methods work well, they often rely on long decoding paths or require a large number of samples to be generated, increasing the token usage and inference latency. We observe the surprising fact that for reasoning tasks, shorter traces are much more likely to be correct than longer ones. Motivated by this, we introduce First Finish Search (FFS), a training-free parallel decoding strategy that launches independent samples and returns as soon as any one completes. We evaluate FFS alongside simple decoding, beam search, majority voting, and budget forcing on four reasoning models (DeepSeek-R1, R1-Distill-Qwen-32B, QwQ-32B and Phi-4-Reasoning-Plus) and across four datasets (AIME24, AIME25-I, AIME25-II and GPQA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEarly Stopping
