Quantifying the Gain in Weak-to-Strong Generalization
Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur

TL;DR
This paper develops a theoretical framework to understand how strong language models improve when trained on labels from weaker models, explaining the weak-to-strong generalization phenomenon and guiding model training choices.
Contribution
It introduces a formal theory linking performance gains to misfit errors on weak model labels, providing predictive insights and validation through experiments.
Findings
Performance improvement correlates with misfit error on weak labels.
The theory predicts the extent of improvement and helps select weak models.
Empirical validation confirms the theoretical predictions.
Abstract
Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by…
Peer Reviews
Decision·NeurIPS 2024 poster
- Proves a clean and intuitive theory for weak-to-strong regression. - Empirical results showing that the proposed bounds are tight (in fact, almost exact).
- The results of WSCM20 are not properly contextualized. Their analysis is *not* limited to a self-training scenario and applies for any student model learning from an arbitrary teacher, including a student that is more powerful than the teacher. - The paper is missing a discussion of and citations to relevant work in other semi- or un-supervised settings that bound generalization error in terms of the disagreement between two classifiers, such as [1], [2], and especially [3]. [1] https://arxi
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications
