On the Spectral Geometry of Cross-Modal Representations: A Functional Map Diagnostic for Multimodal Alignment
Krisanu Sarkar

TL;DR
This paper analyzes the spectral geometry of cross-modal representations from vision and language encoders, revealing a structural decoupling despite similar intrinsic complexity, and proposes diagnostic metrics for alignment.
Contribution
It introduces a spectral geometry framework to diagnose cross-modal alignment issues and uncovers a spectral complexity--orientation gap between independently trained models.
Findings
Laplacian eigenvalue spectra are similar across models (normalized spectral distance 0.043).
Eigenvector bases are effectively unaligned, with near-zero diagonal dominance and high orthogonality error.
The spectral complexity--orientation gap constrains spectral alignment methods.
Abstract
We study cross-modal alignment between independently pretrained vision (DINOv2) and language (all-MiniLM-L6-v2) encoders using the functional map framework from computational geometry, which represents correspondence between representation manifolds as a compact linear operator between graph Laplacian eigenbases. While the framework underperforms Procrustes alignment and relative representations for cross-modal retrieval across all supervision budgets, it reveals a structural property of multimodal representations. We find that the Laplacian eigenvalue spectra of the two encoders are quantitatively similar (normalized spectral distance 0.043), indicating that independently trained models develop manifolds of comparable intrinsic complexity. However, the functional map exhibits near-zero diagonal dominance (mean below 0.05) and large orthogonality error (70.15), showing that the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
