Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

Yi-Ge Zhang; Jingyi Cui; Qiran Li; Yisen Wang

arXiv:2501.01317·cs.LG·March 5, 2026

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

Yi-Ge Zhang, Jingyi Cui, Qiran Li, Yisen Wang

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis showing that removing difficult examples can improve unsupervised contrastive learning's generalization and performance, challenging previous assumptions about their importance.

Contribution

It develops a theoretical framework explaining how difficult examples negatively impact contrastive learning and proposes methods to improve it by removing or tuning these examples.

Findings

01

Removing difficult examples boosts downstream classification performance.

02

Techniques like margin tuning and temperature scaling enhance generalization bounds.

03

Empirical validation confirms the effectiveness of the proposed methods.

Abstract

Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from supervised learning. Previous works have shown that difficult examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this framework, we conduct a thorough theoretical analysis revealing that…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 5

Strengths

- **Novel counterintuitive finding**: Removing 8-20% of training data improves performance consistently across 4 datasets - **Rigorous theoretical framework**: Clean similarity graph model (Theorems 3.1-3.2) proves difficult examples increase error bound from 4δ/(1-λ) to 4δ/(1-λ') where λ' > λ - **Provided solutions with theory**: Margin tuning (Theorem 4.3), temperature scaling (Theorem 4.5), removal (Corollary 4.1) all improve bounds; experiments confirm (Table 4: Combined method +1.6% CIFAR

Weaknesses

- **Circular difficult example definition**: Section 5.1 selects "currently confusing" examples via cosine similarity during training, not "intrinsically difficult" ones. Figure 4c shows ratio evolves to 90%+ suggesting selection is training-dependent - **Scalability concerns**: ImageNet-1K gains only 1.36% (Table 8) vs 2-15% on smaller datasets. Only 400 epochs vs standard 800+. No computational cost analysis in Algorithm 1 - **Hyperparameter theory disconnect**: Theorems 4.3/4.5 derive exact

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper theoretically explains the counterintuitive phenomenon that adding "hard" examples can hurt contrastive learning. The spectral framework is a natural fit and the analysis is clean. The block model exposes the role of $\gamma-\beta$ on linear-probe error. 1. The authors propose practical interventions of margin and temperature adjustments motivated by the theory, with improved theoretical error bound. Consistent empirical improvements are observed when applying the interventions. Fu

Weaknesses

1. The theory assumes $0 \leq \beta < \gamma < \alpha < 1$, yet cosine similarity lies in $[-1,1]$. Are similarities shifted or computed from a PSD kernel where values are guaranteed to be nonnegative? If not, can the theorems be relaxed to allow $\beta,\gamma<0$? 1. A brief discussion of the tightness of the presented bounds would be helpful in understanding the significance of the bound. Context on any known lower bounds or contrasting constructions, or even a heuristic level argument would he

Reviewer 03Rating 6Confidence 4

Strengths

The mathematical framework utilized is a simplified (neural collapsed) version of a proven one used in HaoChen et al., 2021. The mathematical derivations in this framework are sound. The theoretical results here can be important for choices of contrastive learning samples.

Weaknesses

The experimental section is only partially sound, with a limited number of experiments and hyperparameter tuning. The presentation needs to be significantly improved. There are typographical errors throughout the paper. Figures 1, 2, 4 contain text that is too small to read. Certain ideas are not explained clearly enough. The theoretical framework of this paper is based heavily on (HaoChen et al., 2021), which argues extensively about the similarities and differences between the empirical and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Teaching and Learning Methods · Evaluation of Teaching Practices · Online and Blended Learning

MethodsContrastive Learning