On the Emergence of Weak-to-Strong Generalization: A Bias-Variance Perspective
Gengze Xu, Wei Yao, Ziqiao Wang, Yong Liu

TL;DR
This paper provides a theoretical analysis of weak-to-strong generalization, showing how a student model trained on a weak teacher's labels can outperform the teacher, especially when approximating the posterior mean and avoiding overfitting.
Contribution
It offers a bias-variance decomposition analysis of W2SG without restrictive assumptions and demonstrates the benefits of reverse cross-entropy loss empirically.
Findings
W2SG is linked to the misfit between student and teacher models.
Approximating the posterior mean enhances W2SG emergence.
Reverse cross-entropy loss improves student performance.
Abstract
Weak-to-strong generalization (W2SG) refers to the phenomenon where a strong student model, trained on a dataset labeled by a weak teacher, ultimately outperforms the teacher on the target task. Recent studies attribute this performance gain to the prediction misfit between the student and teacher models. In this work, we theoretically investigate the emergence of W2SG through a generalized bias-variance decomposition of Bregman divergence. Specifically, we show that the expected population risk gap between the student and teacher is quantified by the expected misfit between the two models. While this aligns with previous results, our analysis removes several restrictive assumptions, most notably, the convexity of the student's hypothesis class, required in earlier works. Moreover, we show that W2SG is more likely to emerge when the student model approximates its posterior mean teacher,…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper studies an interesting problem and offer a theoretically driven explanation to the weak-to-strong generalization phenomenon observed. - The authors of the paper propose a novel reverse cross-entropy loss and empirically demonstrate the effectiveness of it.
- The experiments conducted use relatively small models, by today's standard. While I understand the computational resource constraint, it would be interesting to use more recent and larger models. - To the best of my understanding, the analysis is done with respect to linear models.
The misfit based approach to W2S generalization is a promising approach that can naturally also model pre-training and encompasses a lot of different model classes. The generalization to non-convex student hypothesis class is a significant improvement which comes from the idea of analyzing expected population risk using the bias-variance decomposition of Bergman divergences, which is an original idea. The mathematical statements are clear and on the first reading seem correct. The main sign
1. One of the main messages of the paper, that W2S is more likely to occur by mimicking the posterior teacher or the algorithmic encouragement to train on an ensemble of teachers are not in the spirit of W2S generalization and have the effect of confounding other effects with W2S. Consider regression setup: of course that training on a less noisy distribution is beneficial. Averaging over many teachers effectively denoises the teacher predictor artificially which makes the student have smaller e
1. The topic of W2SG is timely and relevant; the connection with and distinction from prior works, especially Charikar et al., 2024 and Mulgund & Pabbaraju, 2025 are clearly articulated. 2. Moving from the convex projection arguments in prior works to the expected Bregman bias–variance analysis yields misfit-gain inequalities for non-convex students without convexity or linear head restrictions, addressing a key technical limitation in Charikar et al., 2024 and Mulgund & Pabbaraju, 2025. 3. Emp
1. While the detailed discussion on the relation with prior works and the technical novelties upfront in the introduction (lines 46-82) can provide good motivations for this work for domain experts (as mentioned in Strengths), it could be overwhelming for general audience in the community. I would suggest reorganizing the introduction carefully, e.g., partitioning the technical novelties in lines 60-82 into bullet points, each summarized by a concise, intuitive title explaining the effect/benefi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Reservoir Engineering and Simulation Methods
