To Trust Or Not To Trust Your Vision-Language Model's Prediction
Hao Dong, Moru Liu, Jian Liang, Eleni Chatzi, Olga Fink

TL;DR
TrustVLM is a training-free framework that enhances the reliability of vision-language models by estimating prediction trustworthiness, significantly reducing misclassification risks in safety-critical applications without retraining.
Contribution
The paper introduces TrustVLM, a novel confidence-scoring method leveraging image embedding space to detect misclassifications, improving trustworthiness of VLMs without additional training.
Findings
Achieved up to 51.87% improvement in AURC
Demonstrated state-of-the-art detection performance across datasets
Validated effectiveness on multiple architectures and VLMs
Abstract
Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification…
Peer Reviews
Decision·Submitted to ICLR 2026
1. TrustVLM is a training-free framework designed to evaluate the reliability of VLM predictions. One of its key advantages is that it does not require additional training, which makes it convenient to apply in scenarios where labeled data is limited or unavailable. The framework combines both image-to-text and image-to-image similarities, which allows for a more robust and nuanced design of confidence scores. This combination provides a richer representation of the visual information, enabling
1. A notable limitation of this method is that it relies on the availability of in-domain data that includes images for all classes to be predicted. Under this assumption, the method can extract and store visual prototypes for each class, which are then used for confidence estimation. However, in many practical scenarios, obtaining such in-domain data for every class may be difficult or infeasible. Moreover, if the training or reference data does not fully cover the diversity of the test data, t
- This work addresses an important task: determining when the predictions of a VLM are likely to be reliable. - Although the proposed method is methodologically straightforward, strong performance is observed across a range of datasets and model backbones. The authors also compare with multiple baselines. The distribution shift experiments with ImageNet are particularly compelling.
- **Need for finer-grained analysis:** This paper could benefit from additional fine-grained analysis with respect to when the proposed method is most effective (rather than just overall metrics). For example, are there specific classes where misclassification detection performance improves substantially when using the proposed method (as compared to MSP)? What types of characteristics are common among those classes? - **Variance of performance:** The proposed method is likely very sensitive to
- This paper is easy to follow, the motivation is very clear, and the intuition is quite straightforward. - The proposed TrustVLM is training-free and efficient to deploy. It can also be easily adopted by any VLM architectures. - The experimental performance is quite promising.
- The major concern is missing the comparison with unimodal detection methods. The proposed method combines multimodal information to detect prediction errors; however, in the ablation study, there is no comparison with image-only or text-only detection. In this way, it would be clearer which branch of modality would contribute more to the overall performance improvement. - Moreover, the performance of TrustVLM highly relies on the performance of the employed VLMs; if the VLMs cannot provide hig
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Ethics and Social Impacts of AI
