Evaluating Long-Context Reasoning in LLM-Based WebAgents
Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai

TL;DR
This paper evaluates the ability of large language model-based WebAgents to reason over extended interaction histories, revealing significant performance drops in long contexts and highlighting key challenges for real-world deployment.
Contribution
It introduces a benchmark and evaluation framework for long context reasoning in WebAgents, and provides empirical insights into model limitations and potential improvements.
Findings
Performance drops from 40-50% to less than 10% in long contexts
Agents often get stuck in loops and lose track of tasks
Implicit RAG offers modest improvements in summarization
Abstract
As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
