Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak   Supervision

Collin Burns; Pavel Izmailov; Jan Hendrik Kirchner; Bowen Baker; Leo; Gao; Leopold Aschenbrenner; Yining Chen; Adrien Ecoffet; Manas Joglekar; Jan; Leike; Ilya Sutskever; Jeff Wu

arXiv:2312.09390·cs.CL·December 18, 2023·24 cites

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo, Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan, Leike, Ilya Sutskever, Jeff Wu

PDF

Open Access

TL;DR

This paper investigates whether weak supervision from less capable models can effectively train stronger models, finding that naive finetuning often improves performance but additional techniques are needed for full capabilities.

Contribution

It introduces the concept of weak-to-strong generalization and demonstrates that simple methods can significantly enhance model capabilities under weak supervision.

Findings

01

Naive finetuning on weak model labels improves strong model performance.

02

Simple auxiliary methods can recover near state-of-the-art performance.

03

Weak supervision can be a feasible approach for aligning superhuman models.

Abstract

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam · {Dispute@FaQ-s}How to file a dispute with Expedia? · Attention Dropout · 15 Ways to Contact How can i speak to someone at Delta Airlines