A transfer learning framework for weak-to-strong generalization

Seamus Somerstep; Felipe Maia Polo; Moulinath Banerjee; Ya'acov Ritov,; Mikhail Yurochkin; Yuekai Sun

arXiv:2405.16236·stat.ML·March 17, 2025

A transfer learning framework for weak-to-strong generalization

Seamus Somerstep, Felipe Maia Polo, Moulinath Banerjee, Ya'acov Ritov,, Mikhail Yurochkin, Yuekai Sun

PDF

Open Access

TL;DR

This paper introduces a transfer learning framework that enables stronger language models to be aligned with human feedback without losing capabilities, overcoming limitations of traditional fine-tuning methods.

Contribution

It formulates weak-to-strong generalization as a transfer learning problem and proposes a refinement-based approach that outperforms naive fine-tuning.

Findings

01

Refinement approach overcomes fine-tuning limitations

02

Theoretical proof of weak-to-strong generalization feasibility

03

Practical success in multiple LLM alignment tasks

Abstract

Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether these techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unknown if it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedback without degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using feedback from a weaker (less capable) model to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept prior from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsALIGN