Variational Visual Question Answering for Uncertainty-Aware Selective Prediction

Tobias Jan Wieczorek; Nathalie Daun; Mohammad Emtiyaz Khan; Marcus Rohrbach

arXiv:2505.09591·cs.CV·April 23, 2026

Variational Visual Question Answering for Uncertainty-Aware Selective Prediction

Tobias Jan Wieczorek, Nathalie Daun, Mohammad Emtiyaz Khan, Marcus Rohrbach

PDF

TL;DR

This paper introduces Variational VQA, a Bayesian approach that enhances the reliability and safety of large vision-language models in visual question answering by improving calibration and selective prediction.

Contribution

It demonstrates the effectiveness of variational Bayes for selective prediction in VQA, outperforming traditional methods and proposing a new risk-averse selector for better uncertainty handling.

Findings

01

Variational VQA improves calibration and selective prediction accuracy.

02

A single posterior sample can outperform models trained with AdamW.

03

The risk-averse selector outperforms standard sample averaging.

Abstract

Despite remarkable progress in recent years, Vision Language Models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models predict selectively, that is, models respond only when they are sufficiently confident. Unfortunately, such approaches can be costly and ineffective for large models, and there exists little evidence to show otherwise for multimodal applications. Here, we show for the first time the effectiveness and competitive edge of variational Bayes for selective prediction in VQA. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.