See and Remember: A Multimodal Agent for Web Traversal
Xinjun Wang, Shengyao Wang, Aimin Zhou, Hao Hao

TL;DR
This paper introduces V-GEMS, a multimodal agent architecture that enhances web navigation by integrating visual grounding and explicit memory, leading to improved accuracy and robustness in complex web traversal tasks.
Contribution
The paper presents V-GEMS, a novel multimodal agent with explicit memory and visual grounding, addressing spatial disorientation and navigation loops in web traversal.
Findings
V-GEMS outperforms WebWalker baseline by 28.7%.
The explicit memory system improves backtracking and cycle prevention.
Visual grounding resolves ambiguous interactive elements effectively.
Abstract
Autonomous web navigation requires agents to perceive complex visual environments and maintain long-term context, yet current Large Language Model (LLM) based agents often struggle with spatial disorientation and navigation loops. In this paper, we propose generally applicable V-GEMS(Visual Grounding and Explicit Memory System), a robust multimodal agent architecture designed for precise and resilient web traversal. Our agent integrates visual grounding to resolve ambiguous interactive elements and introduces an explicit memory stack with state tracking. This dual mechanism allows the agent to maintain a structured map of its traversal path, enabling valid backtracking and preventing cyclical failures in deep navigation tasks. We also introduce an updatable dynamic benchmark to rigorously evaluate adaptability. Experiments show V-GEMS significantly dominates the WebWalker baseline,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
