GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding
Mayank Nautiyal, Li Ju, Andreas Hellander, Ekta Vats, Prashant Singh

TL;DR
GeoFlowVLM introduces a geometry-aware post-hoc method for vision-language models that jointly estimates aleatoric and epistemic uncertainties on the hypersphere, improving interpretability and calibration.
Contribution
It proposes a novel Riemannian flow matching approach to model joint distribution of embeddings and derives uncertainty metrics with theoretical justification.
Findings
Entropy correlates with Recall@1 and is well-calibrated across benchmarks.
Marginal-typicality score yields calibrated zero-shot classification accuracy.
Model exposes valid joint and conditional flows on the hypersphere.
Abstract
Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired -normalised dual-encoder VLM embeddings on the product hypersphere via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
