Evaluating Long-Context Reasoning in LLM-Based WebAgents

Andy Chung; Yichi Zhang; Kaixiang Lin; Aditya Rawal; Qiaozi Gao; Joyce Chai

arXiv:2512.04307·cs.LG·December 5, 2025

Evaluating Long-Context Reasoning in LLM-Based WebAgents

Andy Chung, Yichi Zhang, Kaixiang Lin, Aditya Rawal, Qiaozi Gao, Joyce Chai

PDF

Open Access

TL;DR

This paper evaluates the ability of large language model-based WebAgents to reason over extended interaction histories, revealing significant performance drops in long contexts and highlighting key challenges for real-world deployment.

Contribution

It introduces a benchmark and evaluation framework for long context reasoning in WebAgents, and provides empirical insights into model limitations and potential improvements.

Findings

01

Performance drops from 40-50% to less than 10% in long contexts

02

Agents often get stuck in loops and lose track of tasks

03

Implicit RAG offers modest improvements in summarization

Abstract

As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques