Are vision language models robust to uncertain inputs?
Xi Wang, Eric Nalisnick

TL;DR
This paper evaluates the robustness of vision language models to uncertain inputs, revealing that larger models are more robust but still prone to hallucinations, and proposes a caption diversity method to better estimate uncertainty.
Contribution
It provides an empirical assessment of VLM robustness, demonstrates the effectiveness of prompting for abstention, and introduces a novel caption diversity mechanism for uncertainty estimation.
Findings
Larger VLMs show improved robustness but still hallucinate on ambiguous inputs.
Prompting models to abstain significantly improves reliability on natural images.
Caption diversity can predict model success in abstaining without labeled data.
Abstract
Robustness against uncertain and ambiguous inputs is a critical challenge for deep learning models. While recent advancements in large scale vision language models (VLMs, e.g. GPT4o) might suggest that increasing model and training dataset size would mitigate this issue, our empirical evaluation shows a more complicated picture. Testing models using two classic uncertainty quantification tasks, anomaly detection and classification under inherently ambiguous conditions, we find that newer and larger VLMs indeed exhibit improved robustness compared to earlier models, but still suffer from a tendency to strictly follow instructions, often causing them to hallucinate confident responses even when faced with unclear or anomalous inputs. Remarkably, for natural images such as ImageNet, this limitation can be overcome without pipeline modifications: simply prompting models to abstain from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
