Weak-to-Strong Generalization under Distribution Shifts
Myeongho Jeon, Jan Sobotka, Suhwan Choi, Maria Brbi\'c

TL;DR
This paper introduces RAVEN, a framework that enhances weak-to-strong model generalization under distribution shifts by dynamically combining weak models, significantly improving out-of-distribution performance across various tasks.
Contribution
RAVEN is a novel method that adaptively learns optimal weak model combinations and strong model parameters to improve robustness under distribution shifts.
Findings
RAVEN outperforms baselines by over 30% on out-of-distribution tasks.
RAVEN matches or surpasses existing methods on in-distribution tasks.
RAVEN effectively identifies trustworthy weak models by assigning higher weights to more accurate ones.
Abstract
As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
