Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Junhong Shen; Hao Bai; Lunjun Zhang; Yifei Zhou; Amrith Setlur; Shengbang Tong; Diego Caples; Nan Jiang; Tong Zhang; Ameet Talwalkar; Aviral Kumar

arXiv:2506.07976·cs.LG·June 11, 2025

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new approach called Test-Time Interaction (TTI) that enhances agent performance by increasing their interaction horizon, enabling more adaptive and exploratory behaviors during task execution, demonstrated on web agent benchmarks.

Contribution

The paper proposes TTI, a curriculum-based online RL method that adaptively scales test-time interaction, significantly improving web agent performance and enabling dynamic behavior adaptation.

Findings

01

TTI achieves state-of-the-art results on WebVoyager and WebArena benchmarks.

02

Interaction scaling improves task success even without additional training.

03

TTI enables agents to balance exploration and exploitation adaptively.

Abstract

The current paradigm of test-time scaling relies on generating long reasoning traces ("thinking" more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent's interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

test-time-interaction/tti
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Mobile Crowdsensing and Crowdsourcing · Multimodal Machine Learning Applications