Theoretical Analysis of Contrastive Learning under Imbalanced Data: From Training Dynamics to a Pruning Solution
Haixu Liao, Yating Zhou, Songyang Zhang, Meng Wang, Shuai Zhang

TL;DR
This paper provides a theoretical analysis of contrastive learning with Transformer encoders on imbalanced data, revealing neuron dynamics and proposing pruning as a solution to improve representation quality.
Contribution
It introduces a novel theoretical framework for contrastive learning under data imbalance, detailing neuron training stages and proposing pruning to mitigate imbalance effects.
Findings
Neuron weights evolve through three training stages.
Minority features reduce representational capacity.
Pruning restores performance and improves feature separation.
Abstract
Contrastive learning has emerged as a powerful framework for learning generalizable representations, yet its theoretical understanding remains limited, particularly under imbalanced data distributions that are prevalent in real-world applications. Such an imbalance can degrade representation quality and induce biased model behavior, yet a rigorous characterization of these effects is lacking. In this work, we develop a theoretical framework to analyze the training dynamics of contrastive learning with Transformer-based encoders under imbalanced data. Our results reveal that neuron weights evolve through three distinct stages of training, with different dynamics for majority features, minority features, and noise. We further show that minority features reduce representational capacity, increase the need for more complex architectures, and hinder the separation of ground-truth features…
Peer Reviews
Decision·ICLR 2026 Poster
Both the architecture and the training protocol are relevant. Theoretical predictions for the training dynamics of Transformer-based encoders are of utmost interest. Narratives on the assumptions behind the data model and on the formal results are provided. Numerical results on real data support the theoretical claims on the advantage of pruning.
The formal results are hard to read, as the main text is not really self-contained (see Questions below), despite the commendable effort of Table 1. Numerical illustrations in the vanilla setting, even with synthetic data, could help explaining the practical relevance of the bounds provided (for example, by tracking the inner products of lucky/non-lucky neurons with features during training in the 3 regimes, and comparing with theoretical bounds). Considering that reviewing proofs in Appendix is
- Provides a rare, neuron-level theoretical analysis of contrastive learning under data imbalance, clearly explaining how minority features are under-learned. - Connects the analysis to a simple, practical fix (magnitude-based forward-masked, backward-unmasked pruning), making the work actionable. - Writing and structure are generally clear, making a dense theoretical contribution reasonably accessible.
- Experiments mainly compare “with vs. without pruning” and lack baselines from other long-tailed methods. - Sensitivity to pruning ratio/schedule is not deeply analyzed. - Paper could more explicitly discuss limitations and when the proposed analysis may not apply.
This paper explores how imbalanced data degrades representation quality in contrastive learning from a novel perspective of neural weight evolution. Through extensive theoretical analysis, the authors demonstrate that a minority of features weakens representational power while increasing the demand for complex architectures. Building upon this, they further prove that pruning techniques enhance gradient updates along these dominant features, thereby mitigating performance degradation caused by i
1:This paper uses numerous notations, some of which lack clear definitions upon their first appearance. Additionally, maintaining consistent notation throughout the text would improve readability. 2:Could the authors please clarify the meaning of "feature frequency"? Specifically, how are the majority and minority features identified within the unsupervised learning framework? 3: The study is confined to the Transformer-MLP model. Could the authors discuss the generalizability of their approac
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImbalanced Data Classification Techniques · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
