BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers   With Limited Data

Jean-Loup Tastet; Inar Timiryasov

arXiv:2409.17312·cs.CL·September 27, 2024

BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data

Jean-Loup Tastet, Inar Timiryasov

PDF

Open Access 5 Models

TL;DR

BabyLlama-2, a 345M parameter model, outperforms its teachers and baselines on key benchmarks in limited data scenarios, highlighting the effectiveness of ensemble distillation.

Contribution

This work introduces BabyLlama-2, demonstrating that ensemble distillation yields superior performance over teachers in data-limited settings.

Findings

01

BabyLlama-2 outperforms teachers and baselines on BLiMP and SuperGLUE.

02

Distillation benefits are not due to hyperparameter suboptimality.

03

Distillation techniques need further exploration in limited data contexts.

Abstract

We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Video Analysis and Summarization · Machine Learning and Data Classification