The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge
Ryoya Awano, Taiji Suzuki

TL;DR
This paper demonstrates that weak-to-strong (W2S) fine-tuning of neural networks can effectively learn specific features while preserving pre-trained capabilities, supported by theoretical proofs and synthetic experiments.
Contribution
It provides the first theoretical analysis of W2S in feature learning, showing how it elicits target features without catastrophic forgetting.
Findings
W2S efficiently learns target features in neural networks.
W2S preserves pre-trained off-target features.
Synthetic experiments confirm theoretical predictions.
Abstract
Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the student's representations or operate in restricted settings. Whether multi-step SGD can succeed in feature learning while preserving diverse pre-trained capabilities remains open. We study W2S in the setting of reward-model learning with two-layer neural networks. The strong model has pre-trained representations organized into low-dimensional subspaces , and is fine-tuned under the supervision of a weak model specialized on task . We prove that the strong model efficiently learns task , eliciting its pre-trained knowledge while retaining general capabilities. This establishes W2S generalization in the feature-learning regime, in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
