Selective Weak-to-Strong Generalization

Hao Lang; Fei Huang; Yongbin Li

arXiv:2511.14166·cs.CL·November 19, 2025

Selective Weak-to-Strong Generalization

Hao Lang, Fei Huang, Yongbin Li

PDF

Open Access 1 Video

TL;DR

This paper introduces a selective weak-to-strong generalization framework that improves model alignment by selectively using weak supervision, employing a classifier to identify answerable questions, and refining labels with graph smoothing, leading to better robustness and performance.

Contribution

It proposes a novel selective W2SG approach that avoids unnecessary weak supervision, enhancing robustness and generalization in superhuman model alignment.

Findings

01

Outperforms baseline methods on three benchmarks

02

Classifier P(IK) generalizes across tasks and difficulties

03

Refined labels improve model robustness

Abstract

Future superhuman models will surpass the ability of humans and humans will only be able to \textit{weakly} supervise superhuman models. To alleviate the issue of lacking high-quality data for model alignment, some works on weak-to-strong generalization (W2SG) finetune a strong pretrained model with a weak supervisor so that it can generalize beyond weak supervision. However, the invariable use of weak supervision in existing methods exposes issues in robustness, with a proportion of weak labels proving harmful to models. In this paper, we propose a selective W2SG framework to avoid using weak supervision when unnecessary. We train a binary classifier P(IK) to identify questions that a strong model can answer and use its self-generated labels for alignment. We further refine weak labels with a graph smoothing method. Extensive experiments on three benchmarks show that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Selective Weak-to-Strong Generalization· underline

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Topic Modeling