WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang; Srinivas Sunkara; Gilles Baechler; Jason Lin; Yun Zhu,; Fedir Zubach; Lei Shu; Jindong Chen

arXiv:2409.13711·cs.IR·September 26, 2024

WebQuest: A Benchmark for Multimodal QA on Web Page Sequences

Maria Wang, Srinivas Sunkara, Gilles Baechler, Jason Lin, Yun Zhu,, Fedir Zubach, Lei Shu, Jindong Chen

PDF

Open Access

TL;DR

WebQuest introduces a challenging multimodal web page sequence dataset for multi-page question answering, highlighting the need for advanced reasoning across multiple web pages and evaluating current models' capabilities.

Contribution

The paper presents WebQuest, a novel benchmark for multimodal multi-page QA that emphasizes real-world web reasoning and evaluates both proprietary and open-source models.

Findings

01

Significant performance gap between single-screen and multi-screen reasoning.

02

Evaluation of leading models reveals limitations in multi-page reasoning.

03

Chain-of-Thought prompting improves multi-screen inference capabilities.

Abstract

The rise of powerful multimodal LLMs has enhanced the viability of building web agents which can, with increasing levels of autonomy, assist users to retrieve information and complete tasks on various human-computer interfaces. It is hence necessary to build challenging benchmarks that span a wide-variety of use cases reflecting real-world usage. In this work, we present WebQuest, a multi-page question-answering dataset that requires reasoning across multiple related web pages. In contrast to existing UI benchmarks that focus on multi-step web navigation and task completion, our dataset evaluates information extraction, multimodal retrieval and composition of information from many web pages. WebQuest includes three question categories: single-screen QA, multi-screen QA, and QA based on navigation traces. We evaluate leading proprietary multimodal models like GPT-4V, Gemini Flash, Claude…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Digital Technologies

MethodsFocus