WebVLN: Vision-and-Language Navigation on Websites
Qi Chen, Dileepa Pitawela, Chongyang Zhao, Gengze Zhou, Hsiang-Ting, Chen, Qi Wu

TL;DR
This paper introduces WebVLN, a novel vision-and-language navigation task on websites, along with a dataset and a specialized network, enabling AI agents to navigate web content using question-based instructions and web-specific information.
Contribution
The paper presents a new WebVLN task, a dataset WebVLN-v1, and a novel WebVLN-Net model that incorporates web-specific content for improved navigation performance.
Findings
WebVLN-Net outperforms existing VLN and web navigation methods.
The WebVLN dataset enables research on web-based navigation tasks.
Incorporating HTML content improves navigation accuracy.
Abstract
Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contains rich visual and textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
