Are vision language models robust to uncertain inputs?

Xi Wang; Eric Nalisnick

arXiv:2505.11804·cs.CV·May 20, 2025

Are vision language models robust to uncertain inputs?

Xi Wang, Eric Nalisnick

PDF

Open Access

TL;DR

This paper evaluates the robustness of vision language models to uncertain inputs, revealing that larger models are more robust but still prone to hallucinations, and proposes a caption diversity method to better estimate uncertainty.

Contribution

It provides an empirical assessment of VLM robustness, demonstrates the effectiveness of prompting for abstention, and introduces a novel caption diversity mechanism for uncertainty estimation.

Findings

01

Larger VLMs show improved robustness but still hallucinate on ambiguous inputs.

02

Prompting models to abstain significantly improves reliability on natural images.

03

Caption diversity can predict model success in abstaining without labeled data.

Abstract

Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. Testing models using two classic uncertainty quantification tasks, anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)