Baby Llama: knowledge distillation from an ensemble of teachers trained   on a small dataset with no performance penalty

Inar Timiryasov; Jean-Loup Tastet

arXiv:2308.02019·cs.CL·October 25, 2023·2 cites

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

Inar Timiryasov, Jean-Loup Tastet

PDF

Open Access 1 Repo 8 Models

TL;DR

This paper demonstrates that knowledge distillation from an ensemble of small language models trained on limited data can produce a smaller model that surpasses its teachers' performance without any loss, improving sample efficiency.

Contribution

It introduces a method of distilling an ensemble of small models trained on limited data into a single small model that outperforms both the teachers and non-distilled models.

Findings

01

Distilled model exceeds teacher performance.

02

Distillation improves sample efficiency.

03

No performance penalty with small datasets.

Abstract

We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation. This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

timinar/babyllama
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Cosine Annealing · Linear Layer · Residual Connection · Adam · Dropout · Linear Warmup With Cosine Annealing · Byte Pair Encoding · Weight Decay