TIDE: Trajectory-based Diagnostic Evaluation of Test-Time Improvement in LLM Agents
Hang Yan, Xinyu Che, Fangzhi Xu, Qiushi Sun, Zichen Ding, Kanzhi Cheng, Jian Zhang, Tao Qin, Jun Liu, Qika Lin

TL;DR
This paper introduces TIDE, a framework for diagnosing and understanding the effectiveness of test-time improvement in autonomous LLM agents, focusing on interaction dynamics, memory, and behavior adaptation.
Contribution
The paper presents TIDE, a novel, agent-agnostic evaluation framework that decomposes test-time improvement into key dimensions to better understand performance factors.
Findings
Performance improvements depend on interaction dynamics, not just reasoning scale.
Memory burdens can constrain task completion.
Behavioral analysis reveals key factors influencing success.
Abstract
Recent advances in autonomous LLM agents demonstrate their ability to improve performance through iterative interaction with the environment. We define this paradigm as Test-Time Improvement (TTI). However, the mechanisms under how and why TTI succeed or fail remain poorly understood, and existing evaluation metrics fail to capture their task optimization efficiency, behavior adaptation after erroneous actions, and the specific utility of working memory for task completion. To address these gaps, we propose Test-time Improvement Diagnostic Evaluation (TIDE), an agent-agnostic and environment-agnostic framework that decomposes TTI into three comprehensive and interconnected dimensions. The framework measures (1) the overall temporal dynamics of task completion and (2) identifies whether performance is primarily constrained by recursive looping behaviors or (3) by burdensome accumulated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · AI-based Problem Solving and Planning · Robot Manipulation and Learning
