Mind the Uncertainty in Human Disagreement: Evaluating Discrepancies between Model Predictions and Human Responses in VQA
Jian Lan, Diego Frassinelli, Barbara Plank

TL;DR
This paper evaluates how well vision-language models, especially BEiT3, align with the diverse and uncertain responses of humans in VQA tasks, revealing current limitations and proposing calibration improvements.
Contribution
It introduces new metrics to assess model alignment with human response distributions and analyzes the impact of calibration techniques on capturing human uncertainty in VQA.
Findings
BEiT3 struggles to model multi-label human response distributions.
Calibration techniques aimed at accuracy can worsen alignment with human responses.
Calibrating models towards human distributions improves alignment with human uncertainty.
Abstract
Large vision-language models frequently struggle to accurately predict responses provided by multiple human annotators, particularly when those responses exhibit human uncertainty. In this study, we focus on the Visual Question Answering (VQA) task, and we comprehensively evaluate how well the state-of-the-art vision-language models correlate with the distribution of human responses. To do so, we categorize our samples based on their levels (low, medium, high) of human uncertainty in disagreement (HUD) and employ not only accuracy but also three new human-correlated metrics in VQA, to investigate the impact of HUD. To better align models with humans, we also verify the effect of common calibration and human calibration. Our results show that even BEiT3, currently the best model for this task, struggles to capture the multi-label distribution inherent in diverse human responses.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis · Bayesian Modeling and Causal Inference · Explainable Artificial Intelligence (XAI)
