TL;DR
This paper provides a formal analysis of weak-to-strong generalization in CNNs, revealing different mechanisms in data-scarce and data-abundant regimes, and highlighting the dynamics of overfitting and label correction.
Contribution
It introduces a theoretical framework analyzing weak-to-strong generalization from linear to nonlinear CNNs with structured data, identifying distinct regimes and mechanisms.
Findings
Generalization in data-scarce regime depends on data amount and can be benign or harmful.
In data-abundant regime, label correction occurs early, but overtraining can harm performance.
Transition boundary between benign and harmful overfitting is characterized.
Abstract
Weak-to-strong generalization refers to the phenomenon where a stronger model trained under supervision from a weaker one can outperform its teacher. While prior studies aim to explain this effect, most theoretical insights are limited to abstract frameworks or linear/random feature models. In this paper, we provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong). We consider structured data composed of label-dependent signals of varying difficulty and label-independent noise, and analyze gradient descent dynamics when the strong model is trained on data labeled by the pretrained weak model. Our analysis identifies two regimes -- data-scarce and data-abundant -- based on the signal-to-noise characteristics of the dataset, and reveals distinct mechanisms of weak-to-strong generalization. In the data-scarce regime,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
