Scaling Test-Time Compute for Agentic Coding
Joongwon Kim, Wannan Yang, Kelvin Niu, Hongming Zhang, Yun Zhu, Eryk Helenowski, Ruan Silva, Zhengxing Chen, Srinivasan Iyer, Manzil Zaheer, Daniel Fried, Hannaneh Hajishirzi, Sanjeev Arora, Gabriel Synnaeve, Ruslan Salakhutdinov, Anirudh Goyal

TL;DR
This paper introduces a test-time scaling framework for agentic coding that uses compact trajectory summaries to improve large language model performance on long-horizon tasks.
Contribution
It proposes a novel representation-based approach with Recursive Tournament Voting and Parallel-Distill-Refine for effective inference-time scaling.
Findings
Improves SWE-Bench Verified accuracy from 70.9% to 77.6%.
Enhances Terminal-Bench v2.0 performance from 46.9% to 59.1%.
Demonstrates the importance of representation, selection, and reuse in long-horizon agent scaling.
Abstract
Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
