Selectively Answering Visual Questions
Julian Martin Eisenschlos, Hern\'an Maina, Guido Ivetta, Luciana, Benotti

TL;DR
This paper analyzes calibration methods for large multi-modal models in visual question answering, proposing a new metric and demonstrating the importance of calibration for selective answering in vision tasks.
Contribution
It provides the first in-depth analysis of calibration techniques for VQA with in-context learning LMMs and introduces the Avg BLEU score for better calibration assessment.
Findings
Likelihood scores are better calibrated than text-only models.
Sampling methods generally outperform likelihood-based methods.
Avg BLEU combines sampling and likelihood benefits across modalities.
Abstract
Recently, large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks such as captioning and visual question answering (VQA) with unprecedented accuracy. Applications such as helping the blind or visually impaired have a critical need for precise answers. It is specially important for models to be well calibrated and be able to quantify their uncertainty in order to selectively decide when to answer and when to abstain or ask for clarifications. We perform the first in-depth analysis of calibration methods and metrics for VQA with in-context learning LMMs. Studying VQA on two answerability benchmarks, we show that the likelihood score of visually grounded models is better calibrated than in their text-only counterparts for in-context learning, where sampling based methods are generally superior, but no clear winner arises. We propose Avg BLEU, a calibration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
