Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective
Hamid Osooli, Kareema Batool, Rick Gentry, Tiasa Singha Roy, Ashwin Gupta, Anirudha Ramesh

TL;DR
This paper analyzes the risks of weak-to-strong model alignment failures using a bias-variance perspective, highlighting the importance of model variance as an early warning for confident errors.
Contribution
It introduces a bias-variance framework to understand weak-to-strong alignment failures and identifies strong-model variance as a key predictor of deception.
Findings
Strong-model variance predicts confident errors across pipelines.
Covariance adds some information but is less predictive.
Blind-spot metrics help distinguish failure sources.
Abstract
Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
