Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

Hamid Osooli; Kareema Batool; Rick Gentry; Tiasa Singha Roy; Ashwin Gupta; Anirudha Ramesh

arXiv:2604.25077·cs.AI·April 29, 2026

Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective

Hamid Osooli, Kareema Batool, Rick Gentry, Tiasa Singha Roy, Ashwin Gupta, Anirudha Ramesh

PDF

TL;DR

This paper analyzes the risks of weak-to-strong model alignment failures using a bias-variance perspective, highlighting the importance of model variance as an early warning for confident errors.

Contribution

It introduces a bias-variance framework to understand weak-to-strong alignment failures and identifies strong-model variance as a key predictor of deception.

Findings

01

Strong-model variance predicts confident errors across pipelines.

02

Covariance adds some information but is less predictive.

03

Blind-spot metrics help distinguish failure sources.

Abstract

Weak-to-strong alignment offers a promising route to scalable supervision, but it can fail when a strong model becomes confidently wrong on examples that lie in the weak teacher's blind spots. Understanding such failures requires going beyond aggregate accuracy, since weak-to-strong errors depend not only on whether the strong model disagrees with its teacher, but also on how confidence and uncertainty are distributed across examples. In this work, we analyze weak-to-strong alignment through a bias-variance-covariance lens that connects misfit theory to practical post-training pipelines. We derive a misfit-based upper bound on weak-to-strong population risk and study its empirical components using continuous confidence scores. We evaluate four weak-to-strong pipelines spanning supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and reinforcement learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.