Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
Inar Timiryasov, Jean-Loup Tastet

TL;DR
This paper demonstrates that knowledge distillation from an ensemble of small language models trained on limited data can produce a smaller model that surpasses its teachers' performance without any loss, improving sample efficiency.
Contribution
It introduces a method of distilling an ensemble of small models trained on limited data into a single small model that outperforms both the teachers and non-distilled models.
Findings
Distilled model exceeds teacher performance.
Distillation improves sample efficiency.
No performance penalty with small datasets.
Abstract
We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timinar/baby-llama-58mmodel· 669 dl· ♡ 11669 dl♡ 11
- 🤗andrijdavid/baby-llama-58m-GGUFmodel· 1 dl1 dl
- 🤗RichardErkhov/timinar_-_baby-llama-58m-4bitsmodel
- 🤗RichardErkhov/timinar_-_baby-llama-58m-8bitsmodel
- 🤗HenryHHHH/DistilLlamamodel· 9 dl· ♡ 39 dl♡ 3
- 🤗HenryHHHH/DistilLlamaV1model· 3 dl3 dl
- 🤗RichardErkhov/HenryHHHH_-_DistilLlamaV1-ggufmodel· 21 dl21 dl
- 🤗RichardErkhov/HenryHHHH_-_DistilLlama-ggufmodel· 49 dl49 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Adam · Dropout · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Weight Decay
