Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez,, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

TL;DR
This paper introduces a framework for reliable visual question answering (VQA) that emphasizes abstaining from answering when uncertain, proposes new metrics and methods to improve coverage while maintaining low error rates, and promotes self-aware models in multimodal AI.
Contribution
It formulates a new reliable VQA problem emphasizing abstention, proposes a multimodal selection function, and introduces an Effective Reliability metric to better evaluate model performance.
Findings
Models with abstention can answer less than 7.5% of questions at 1% error risk.
Using a multimodal selection function increases coverage from 6.8% to 15.6%.
The proposed metric emphasizes the importance of abstaining to improve reliability.
Abstract
Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a question), such ability has been largely neglected in multimodal research, despite the importance of this problem to the usage of VQA in real settings. In this work, we promote a problem formulation for reliable VQA, where we prefer abstention over providing an incorrect answer. We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion. For that, we explore several abstention approaches. We find that although the best performing models achieve over 70% accuracy on the VQA v2 dataset, introducing the option to abstain by directly using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsSoftmax
