Post-hoc Probabilistic Vision-Language Models

Anton Baumann; Rui Li; Marcus Klasson; Santeri Mentu; Shyamgopal Karthik; Zeynep Akata; Arno Solin; Martin Trapp

arXiv:2412.06014·cs.CV·February 16, 2026

Post-hoc Probabilistic Vision-Language Models

Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp

PDF

Open Access 1 Repo 2 Models 3 Reviews

TL;DR

This paper introduces a post-hoc Bayesian method to quantify uncertainties in vision-language models without retraining, enhancing their reliability and sample efficiency in active learning and safety-critical tasks.

Contribution

It proposes a novel post-hoc Bayesian approach for uncertainty estimation in VLMs, improving calibration and interpretability without additional training.

Findings

01

Improved uncertainty calibration over baselines

02

Enhanced sample efficiency in active learning

03

Potential for safer deployment in critical applications

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The mathmatical process is solid. 2. This is a post-hoc method, which does not require any retraining, fine-tuning, or modifications to the VLM architecture. It is only a Laplace approximation to quantify uncertainty. It is easy to apply.

Weaknesses

1. The motivation to measure uncertainty of VLM (clip) is not attractive. The significance of the topic need be emphasized. Maybe it is for effectively selecting training data in active learning. If BayesVLM has more application fields, it will be better. 2. The method relies on two assumptions that may not always hold: (a) the image and text embeddings can be modeled with Gaussian distributions, and (b) the two modalities are independent. These are simplified situations.

Reviewer 02Rating 6Confidence 4

Strengths

1. The proposed method is training-free and works with off-the-shelf CLIP-like models. 2. The proposed method provides overall better calibration performance on the evaluated benchmarks with small computation overhead. 3. Proxy-data robustness: Hessians from CC12M still work decently for CLIP trained on LAION-400M.

Weaknesses

1. Treating image and textual modalities as independent is the core approximation. While the authors justify it via local post-hoc around MAP, it remains a potential mismatch for strongly coupled modalities; discussion is present but could use a stronger empirical illustration. 2. The proposed method puts all epistemic uncertainty in the final projections. This may under-estimate uncertainty on heavy distribution shifts. 3. For closed-source models, the approach needs proxy data; results are p

Reviewer 03Rating 6Confidence 4

Strengths

- **Novelty and conceptual clarty:** The idea of using a post-hoc Laplace approximation on pre-trained VLMs to obtain Bayesian uncertainty estimates is novel and conceptually elegant. The formulation is well-grounded in Bayesian deep learning literature and bridges it with the practical need for scalable uncertainty estimation in large multimodal models. - **Practicality and scalability:** The method is traning-free and model-agnostic, requiring only access to the final projection weights and no

Weaknesses

- **Assumption of Independence (Modality Factorization)** The method assumes independence between image and text modalities (P ⊥⊥ Q) to enable tractable posterior estimation. Although the authors justify this as a local approximation around the MAP estimate, it weakens the theoretical rigor since VLMs are inherently cross-modal. Also, the impact of this assumption on downstream uncertainty fidelity is not fully explored. Can the authors comment on this? - **Limited Scope of Bayesian treatment:**

Code & Models

Repositories

AaltoML/BayesVLM
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training