Acknowledging Focus Ambiguity in Visual Questions
Chongyan Chen, Yu-Yun Tseng, Zhuoheng Li, Anush Venkatesh, Danna Gurari

TL;DR
This paper introduces a new VQA dataset addressing focus ambiguity, enabling models to recognize and locate multiple plausible regions in images that a question might refer to, highlighting challenges for current models.
Contribution
It presents the first dataset specifically designed to handle focus ambiguity in visual questions and benchmarks models on related recognition and localization tasks.
Findings
The dataset reveals significant challenges for modern VQA models.
Models struggle to identify all plausible focus regions.
The dataset provides a new benchmark for focus ambiguity tasks.
Abstract
No published work on visual question answering (VQA) accounts for ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each plausible image region a question could refer to when arriving at valid answers. We next analyze and compare our dataset to existing datasets to reveal its unique properties. Finally, we benchmark modern models for two novel tasks related to acknowledging focus ambiguity: recognizing whether a visual question has focus ambiguity and locating all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly share the dataset with an evaluation server at https://vizwiz.org/tasks-and-datasets/focus-ambiguity-in-visual-questions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
MethodsFocus
