VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

Chenyu Wang; Tianle Chen; H. M. Sabbir Ahmad; Kayhan Batmanghelich; Wenchao Li

arXiv:2602.09214·cs.CV·February 11, 2026

VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

Chenyu Wang, Tianle Chen, H. M. Sabbir Ahmad, Kayhan Batmanghelich, Wenchao Li

PDF

Open Access

TL;DR

This paper introduces VLM-UQBench, a comprehensive benchmark for evaluating modality-specific and cross-modal uncertainty in vision-language models, revealing current UQ methods' limitations in detecting subtle, instance-level ambiguities.

Contribution

The paper presents VLM-UQBench, a new benchmark with perturbation-based evaluation metrics for modality-aware uncertainty in VLMs, highlighting gaps in current UQ methods.

Findings

01

Existing UQ methods show modality-specific strengths and weaknesses.

02

UQ scores weakly correlate with hallucinations and often fail to detect subtle ambiguities.

03

UQ methods perform comparably to reasoning-based baselines on overt ambiguities but struggle with fine-grained uncertainty.

Abstract

Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Topic Modeling