GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

Mayank Nautiyal; Li Ju; Andreas Hellander; Ekta Vats; Prashant Singh

arXiv:2605.13352·cs.LG·May 14, 2026

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

Mayank Nautiyal, Li Ju, Andreas Hellander, Ekta Vats, Prashant Singh

PDF

TL;DR

GeoFlowVLM introduces a geometry-aware post-hoc method for vision-language models that jointly estimates aleatoric and epistemic uncertainties on the hypersphere, improving interpretability and calibration.

Contribution

It proposes a novel Riemannian flow matching approach to model joint distribution of embeddings and derives uncertainty metrics with theoretical justification.

Findings

01

Entropy correlates with Recall@1 and is well-calibrated across benchmarks.

02

Marginal-typicality score yields calibrated zero-shot classification accuracy.

03

Model exposes valid joint and conditional flows on the hypersphere.

Abstract

Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $ℓ_{2}$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $ℓ_{2}$ -normalised dual-encoder VLM embeddings on the product hypersphere $S^{d - 1} \times S^{d - 1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.