Provable Accuracy Collapse in Embedding-Based Representations under Dimensionality Mismatch
Dionysis Arvanitakis, Vaggos Chatziafratis, Yiyuan Luo

TL;DR
This paper proves that embedding-based representations in machine learning face a sharp accuracy decline if the embedding dimension is significantly lower than the true data dimension, even under standard contrastive learning scenarios.
Contribution
It establishes fundamental information-theoretic limits and computational hardness results for low-dimensional embeddings in contrastive learning.
Findings
Accuracy collapses when embedding dimension is below a constant fraction of the ground-truth dimension.
Every low-dimensional embedding violates half of the triplet constraints, leading to trivial solutions.
Under the Unique Games Conjecture, no polynomial-time algorithm can surpass 50% accuracy regardless of embedding dimension.
Abstract
Embedding-based representations in Euclidean space are a cornerstone of modern machine learning, where a major goal is to use the \emph{smallest dimension} that faithfully captures data relations. In this work, we prove sharp dimension--accuracy tradeoffs and identify a fundamental information-theoretic limitation: unless the embedding dimension is chosen close to the ground-truth dimension , accuracy undergoes a sudden collapse. Our main result shows that this phenomenon arises even in standard contrastive learning settings, where supervision is limited to a set of anchor--positive--negative triplets encoding distance comparisons . Specifically, given triplets realizable by an unknown ground-truth embedding in dimensions, we prove that there exists constant , such that \emph{every embedding of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
