Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale
A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

TL;DR
This paper critically examines the claim that neural networks trained on different modalities converge to similar representations, finding that such alignment is fragile, dataset-dependent, and weaker than previously thought.
Contribution
The study challenges the robustness of cross-modal representational convergence claims, highlighting the importance of evaluation regimes and dataset scale.
Findings
Alignment degrades with larger datasets.
Remaining alignment reflects coarse semantic overlap.
Newer models do not show increased alignment.
Abstract
The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets (1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
