Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

A. Sophia Koepke; Daniil Zverev; Shiry Ginosar; Alexei A. Efros

arXiv:2604.18572·cs.CV·April 21, 2026

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar, Alexei A. Efros

PDF

1 Repo 1 Datasets

TL;DR

This paper critically examines the claim that neural networks trained on different modalities converge to similar representations, finding that such alignment is fragile, dataset-dependent, and weaker than previously thought.

Contribution

The study challenges the robustness of cross-modal representational convergence claims, highlighting the importance of evaluation regimes and dataset scale.

Findings

01

Alignment degrades with larger datasets.

02

Remaining alignment reflects coarse semantic overlap.

03

Newer models do not show increased alignment.

Abstract

The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ( $\approx$ 1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

akoepke/cave_umwelten
github

Datasets

askoepke/wit_1m_recaptioned
dataset· 314 dl
314 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.