ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon; Yannis Karmim; Julio Silva-Rodr\'iguez; Paul Couairon; Cl\'ement Rambour; Rapha\"el Fournier-Sniehotta; Ismail Ben Ayed; Jose Dolz; Nicolas Thome

arXiv:2507.07620·cs.CV·September 22, 2025

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

Marc Lafon, Yannis Karmim, Julio Silva-Rodr\'iguez, Paul Couairon, Cl\'ement Rambour, Rapha\"el Fournier-Sniehotta, Ismail Ben Ayed, Jose Dolz, Nicolas Thome

PDF

Open Access

TL;DR

ViLU introduces a novel framework for uncertainty quantification in vision-language models, leveraging multi-modal representations and a loss-agnostic predictor to improve failure prediction across diverse datasets.

Contribution

The paper presents ViLU, a new post-hoc uncertainty quantification method that integrates visual and textual features for better failure prediction in vision-language models.

Findings

01

Significant improvements over state-of-the-art failure prediction methods.

02

Effective on both classification and large-scale caption datasets.

03

Ablation studies confirm architecture and training effectiveness.

Abstract

Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrastructure Maintenance and Monitoring · Anomaly Detection Techniques and Applications · Natural Language Processing Techniques