Selectively Answering Visual Questions

Julian Martin Eisenschlos; Hern\'an Maina; Guido Ivetta; Luciana; Benotti

arXiv:2406.00980·cs.CL·June 4, 2024

Selectively Answering Visual Questions

Julian Martin Eisenschlos, Hern\'an Maina, Guido Ivetta, Luciana, Benotti

PDF

Open Access 1 Video

TL;DR

This paper analyzes calibration methods for large multi-modal models in visual question answering, proposing a new metric and demonstrating the importance of calibration for selective answering in vision tasks.

Contribution

It provides the first in-depth analysis of calibration techniques for VQA with in-context learning LMMs and introduces the Avg BLEU score for better calibration assessment.

Findings

01

Likelihood scores are better calibrated than text-only models.

02

Sampling methods generally outperform likelihood-based methods.

03

Avg BLEU combines sampling and likelihood benefits across modalities.

Abstract

Recently, large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks such as captioning and visual question answering (VQA) with unprecedented accuracy. Applications such as helping the blind or visually impaired have a critical need for precise answers. It is specially important for models to be well calibrated and be able to quantify their uncertainty in order to selectively decide when to answer and when to abstain or ask for clarifications. We perform the first in-depth analysis of calibration methods and metrics for VQA with in-context learning LMMs. Studying VQA on two answerability benchmarks, we show that the likelihood score of visually grounded models is better calibrated than in their text-only counterparts for in-context learning, where sampling based methods are generally superior, but no clear winner arises. We propose Avg BLEU, a calibration…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Selectively Answering Visual Questions· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization