WebQA: Multihop and Multimodal QA
Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng, Gao, Yonatan Bisk

TL;DR
WebQA introduces a new benchmark for multihop, multimodal web-based question answering, highlighting the challenges for current models and emphasizing the need for unified reasoning across visual and textual sources.
Contribution
The paper presents WebQA, a challenging benchmark that combines visual and textual reasoning tasks to advance multimodal question answering models.
Findings
Large models struggle with WebQA's multimodal reasoning.
WebQA's secondary text-only task ensures language understanding is maintained.
Current models find WebQA difficult, indicating room for improvement.
Abstract
Scaling Visual Question Answering (VQA) to the open-domain and multi-hop nature of web searches, requires fundamental advances in visual representation learning, knowledge aggregation, and language generation. In this work, we introduce WebQA, a challenging new benchmark that proves difficult for large-scale state-of-the-art models which lack language groundable visual representations for novel objects and the ability to reason, yet trivial for humans. WebQA mirrors the way humans use the web: 1) Ask a question, 2) Choose sources to aggregate, and 3) Produce a fluent language response. This is the behavior we should be expecting from IoT devices and digital assistants. Existing work prefers to assume that a model can either reason about knowledge in images or in text. WebQA includes a secondary text-only QA task to ensure improved visual performance does not come at the cost of language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
