InfoNCE Induces Gaussian Distribution
Roy Betser, Eyal Gofer, Meir Yossef Levi, Guy Gilboa

TL;DR
This paper demonstrates that the InfoNCE contrastive loss induces Gaussian distributions in learned representations, providing a theoretical explanation for the Gaussianity observed in contrastive learning models.
Contribution
It establishes that contrastive training with InfoNCE leads to Gaussian representations under certain conditions, supported by theoretical analysis and experiments.
Findings
Representations tend to Gaussian distribution under contrastive learning.
Adding regularization promotes Gaussianity in representations.
Experimental results confirm Gaussian behavior across datasets and architectures.
Abstract
Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and…
Peer Reviews
Decision·ICLR 2026 Oral
- The authors connect new theory on properties of contrastive representations with empirical experiments covering real world models - The paper is dense with theoretical results, but still fairly well structured and easy to follow - The authors release code for reproducibility
1. It seems like the authors missed existing identifiability literature for contrastive learning, e.g. https://arxiv.org/abs/1605.06336, https://proceedings.mlr.press/v54/hyvarinen17a/hyvarinen17a.pdf, https://arxiv.org/abs/1805.08651, https://proceedings.mlr.press/v139/zimmermann21a.html, https://arxiv.org/abs/2007.00810, https://arxiv.org/pdf/2410.21869. It would be good to discuss how the presented theory (which uses a different set of tools) relates to this work; the presented work does not
- The mathematical analysis is rigorous, the first claims (Corollary 1 and Proposition 2) are strong (even if they are directly obtained from two well-known results) and the section 4.2 is very technical but sounded. - The empirical evidence given for real-world contrastive-based models gives credit to the theoretical analysis. There are not many works testing the Gaussian assumption on the representations of foundation models while it is often assumed for downstream applications (e.g. OOD). -
- Regarding Assumption 3, the authors mentioned it is weaker than previous Assumption 1. It is still not clear to me the realizability of such assumption and why it would hold in practice. Assumption 1 may be easier to verify empirically (as the authors already mentioned) and it seems more intuitive. - In the experiments, you consider DINO, a non-contrastive approach that does not introduce any uniformity term in its loss (which is key in your analysis since you always assume the alignment to be
- Theoretical rigor: The paper provides a novel and rigorous analysis using high-dimensional probability and spherical CLT tools (e.g. Hirschfeld–Gebelein–Rényi maximal correlation bound, polar KL decomposition, Maxwell–Poincaré CLT). It derives precise conditions (bounded alignment, uniformity on sphere) under which InfoNCE yields isotropic, Gaussian outputs. - Comprehensive empirical validation: The authors evaluate across diverse settings: a synthetic Laplace dataset, CIFAR-10 with a small e
- Strong asymptotic assumptions: The theory relies on high-dimensional limits and idealized assumptions (infinite negatives, alignment plateau, perfect norm concentration). These may not hold in all practical cases. The authors acknowledge this, noting the results are asymptotic and “alignment plateau and thin-shell concentration… are not guaranteed to hold universally”. It remains unclear how well the Gaussian approximation holds for moderate $d$ or when these assumptions fail. - Dependence on
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
