BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data
Jean-Loup Tastet, Inar Timiryasov

TL;DR
BabyLlama-2, a 345M parameter model, outperforms its teachers and baselines on key benchmarks in limited data scenarios, highlighting the effectiveness of ensemble distillation.
Contribution
This work introduces BabyLlama-2, demonstrating that ensemble distillation yields superior performance over teachers in data-limited settings.
Findings
BabyLlama-2 outperforms teachers and baselines on BLiMP and SuperGLUE.
Distillation benefits are not due to hyperparameter suboptimality.
Distillation techniques need further exploration in limited data contexts.
Abstract
We present BabyLlama-2, a 345 million parameter model distillation-pretrained from two teachers on a 10 million word corpus for the BabyLM competition. On BLiMP and SuperGLUE benchmarks, BabyLlama-2 outperforms baselines trained on both 10 and 100 million word datasets with the same data mix, as well as its teacher models. Through an extensive hyperparameter sweep, we demonstrate that the advantages of distillation cannot be attributed to suboptimal hyperparameter selection of the teachers. Our findings underscore the need for further investigation into distillation techniques, particularly in data-limited settings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Video Analysis and Summarization · Machine Learning and Data Classification
