AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik Chaudhari, George Karypis, Huzefa Rangwala

TL;DR
AgentOccam demonstrates that simple alignment of observation and action spaces with LLM capabilities significantly improves web agent performance, surpassing previous methods without complex strategies.
Contribution
The paper introduces a straightforward approach to enhance LLM-based web agents by refining their observation and action spaces, achieving state-of-the-art results without additional in-context learning or search strategies.
Findings
AgentOccam outperforms previous methods on WebArena benchmark.
Alignment of observation/action spaces boosts success rates by over 160%.
Simple design leverages LLMs' zero-shot capabilities effectively.
Abstract
Autonomy via agents using large language models (LLMs) for personalized, standardized tasks boosts human efficiency. Automating web tasks (like booking hotels within a budget) is increasingly sought after. Fulfilling practical needs, the web agent also serves as an important proof-of-concept example for various agent grounding scenarios, with its success promising advancements in many future applications. Prior research often handcrafts web agent strategies (e.g., prompting templates, multi-agent systems, search methods, etc.) and the corresponding in-context examples, which may not generalize well across all real-world scenarios. On the other hand, there has been limited study on the misalignment between a web agent's observation/action representation and the pre-training data of the LLM it's based on. This discrepancy is especially notable when LLMs are primarily trained for language…
Peer Reviews
Decision·ICLR 2025 Poster
The following are the major strengths of the paper: - They have focused on the first principles of building a domain-specific agent, i.e., a web agent and hence utilized domain-knowledge to optimized how LLMs can be used in the best way for this domain. - The ablations presented are very well presented allowing the reader to understand the impact of each of the changes proposed in the paper.
- The results are shown on WebArena which is a simulated environment. It would be more realistic to see how it performs on WebVoyager which is more realistic in terms of how web behaves. - The tree representations used for planning needs better explanation. I have to honest that I couldn't grasp it as well as I would want to. While I understand it in principle, presenting it with a more detailed example, even if in Appendix, might help someone like me look into the details of it. - While the met
- The overall experiment results are positive, and new SOTA performance was obtained on the WebArena benchmark. - The proposed design is simple yet effective, so it should be rather easy to be applied. - The experiment on WebArena is extensive, with a comparison of multiple base models and ablations to show improvement of individual components.
- The main contribution, and also the main source of improvement, is the simplification of action and observation space. A natural question would be how well these simplifications could be generalized across websites and tasks. Although WebArena contains multiple tasks, it still only contains five websites. To further verify the effectiveness of the proposed method, I would suggest including experiments on other datasets, e.g., Mind2Web, which has more websites. - Similarly, the alignment could
Pros: - The perspective to improve the capability of web agents is great, which it dives into the action space and observation space. It is a common pathway that may take effect in not only web tasks but also other tasks with complex actions and observations. - I think the action optimization is inspired, especially for abstracting and generating new actions. From my perspective, it is similar to the tool learning problem and can be effective. - The improvement in performances seems great, where
Cons: - The author claims to solve the generalizability across all real-world applications in the abstract. However, the method section seems to just illustrate how to deal with the specific web environment. There are too many 'specifically' and 'in particular' in sections 4.1 and 4.2. Does it mean I need to design different merges manually, where the method is just providing a guide or strategy to conduct the manual process? I think an ideal optimization is to conduct the optimization automatic
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Service-Oriented Architecture and Web Services
MethodsALIGN · Balanced Selection
