Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

Aiden Yiliu Li; Xinyue Hao; Shilong Liu; Mengdi Wang

arXiv:2602.02468·cs.AI·February 3, 2026

Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts

Aiden Yiliu Li, Xinyue Hao, Shilong Liu, Mengdi Wang

PDF

Open Access

TL;DR

Avenir-Web is a novel multimodal web agent that uses a mixture of grounding experts and experience-imitation planning to improve long-horizon task execution on complex, dynamic web interfaces, achieving state-of-the-art open-source performance.

Contribution

It introduces a new web agent architecture combining grounding experts, procedural priors, and adaptive memory, advancing open-source capabilities for complex web task automation.

Findings

01

Surpasses prior open-source agents on Online-Mind2Web benchmark

02

Achieves performance parity with top proprietary models

03

Demonstrates robustness across diverse web interfaces

Abstract

Despite advances in multimodal large language models, autonomous web agents still struggle to reliably execute long-horizon tasks on complex and dynamic web interfaces. Existing agents often suffer from inaccurate element grounding, the absence of site-specific procedural knowledge, and unstable long-term task tracking and memory, particularly when operating over complex Document Object Model structures. To address these limitations, we introduce Avenir-Web, a web agent that achieves a new open-source state of the art on the Online-Mind2Web benchmark in real-world deployment. Avenir-Web leverages a Mixture of Grounding Experts, Experience-Imitation Planning for incorporating procedural priors, and a task-tracking checklist combined with adaptive memory to enable robust and seamless interaction across diverse user interface paradigms. We evaluate Avenir-Web on Online-Mind2Web, a rigorous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling