High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong   Generalization and Scaling Laws

M. Emrullah Ildiz; Halil Alperen Gozeten; Ege Onur Taga; Marco; Mondelli; Samet Oymak

arXiv:2410.18837·stat.ML·February 28, 2025

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco, Mondelli, Samet Oymak

PDF

Open Access

TL;DR

This paper provides a detailed theoretical analysis of knowledge distillation in high-dimensional regression, revealing how surrogate models influence target model risk, and demonstrating conditions under which weak-to-strong generalization can outperform traditional training.

Contribution

It offers the first sharp, non-asymptotic bounds for high-dimensional knowledge distillation, characterizes the optimal surrogate model, and clarifies the limits of weak-to-strong generalization in data scaling.

Findings

01

W2S training can outperform strong label training under the same data budget.

02

Optimal surrogate models depend on data distribution and feature relevance.

03

W2S does not improve data scaling laws despite performance benefits.

Abstract

A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsKnowledge Distillation