Statistical Consistency and Generalization of Contrastive Representation Learning
Yuanfan Li, Xiyuan Wei, Tianbao Yang, Yiming Ying

TL;DR
This paper develops a comprehensive statistical learning theory for contrastive representation learning, addressing its consistency, generalization bounds, and retrieval performance, supported by large-scale experiments.
Contribution
It provides the first unified theoretical framework for CRL, establishing statistical consistency, generalization bounds, and analyzing the impact of negative samples.
Findings
Contrastive loss is statistically consistent with optimal ranking.
Generalization bounds of order O(1/m + 1/√n) and O(1/√m + 1/√n) are derived.
Large negative sets empirically improve CRL performance, explained by theory.
Abstract
Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
