Quantifying the Gain in Weak-to-Strong Generalization

Moses Charikar; Chirag Pabbaraju; Kirankumar Shiragur

arXiv:2405.15116·cs.LG·October 24, 2024

Quantifying the Gain in Weak-to-Strong Generalization

Moses Charikar, Chirag Pabbaraju, Kirankumar Shiragur

PDF

Open Access 1 Repo 1 Video 1 Reviews

TL;DR

This paper develops a theoretical framework to understand how strong language models improve when trained on labels from weaker models, explaining the weak-to-strong generalization phenomenon and guiding model training choices.

Contribution

It introduces a formal theory linking performance gains to misfit errors on weak model labels, providing predictive insights and validation through experiments.

Findings

01

Performance improvement correlates with misfit error on weak labels.

02

The theory predicts the extent of improvement and helps select weak models.

03

Empirical validation confirms the theoretical predictions.

Abstract

Recent advances in large language models have shown capabilities that are extraordinary and near-superhuman. These models operate with such complexity that reliably evaluating and aligning them proves challenging for humans. This leads to the natural question: can guidance from weak models (like humans) adequately direct the capabilities of strong models? In a recent and somewhat surprising work, Burns et al. (2023) empirically demonstrated that when strong models (like GPT-4) are finetuned using labels generated by weak supervisors (like GPT-2), the strong models outperform their weaker counterparts -- a phenomenon they term weak-to-strong generalization. In this work, we present a theoretical framework for understanding weak-to-strong generalization. Specifically, we show that the improvement in performance achieved by strong models over their weaker counterparts is quantified by…

Peer Reviews

Decision·NeurIPS 2024 poster

Reviewer 01Rating 7Confidence 5

Strengths

- Proves a clean and intuitive theory for weak-to-strong regression. - Empirical results showing that the proposed bounds are tight (in fact, almost exact).

Weaknesses

- The results of WSCM20 are not properly contextualized. Their analysis is *not* limited to a self-training scenario and applies for any student model learning from an arbitrary teacher, including a student that is more powerful than the teacher. - The paper is missing a discussion of and citations to relevant work in other semi- or un-supervised settings that bound generalization error in terms of the disagreement between two classifiers, such as [1], [2], and especially [3]. [1] https://arxi

Code & Models

Repositories

chogba/wtsg-regression
noneOfficial

Videos

Quantifying the Gain in Weak-to-Strong Generalization· slideslive

Taxonomy

TopicsNeural Networks and Applications