When Babies Teach Babies: Can student knowledge sharing outperform Teacher-Guided Distillation on small datasets?
Srikrishna Iyer

TL;DR
This paper explores a teacher-less, student knowledge sharing approach for data-efficient language model pretraining, demonstrating it can match or outperform traditional teacher-guided methods on small datasets.
Contribution
Introduces a dynamic weighted mutual learning framework that eliminates the need for a teacher model, improving data efficiency in language model pretraining.
Findings
Teacher-less methods match or surpass teacher-supervised approaches.
Dynamic weighting improves knowledge distillation effectiveness.
Bi-level optimization enhances student diversity and performance.
Abstract
We present our submission to the BabyLM challenge, aiming to push the boundaries of data-efficient language model pretraining. Our method builds upon deep mutual learning, introducing a student model search for diverse initialization. We address the limitation of treating students equally by formulating weighted mutual learning as a bi-level optimization problem. The inner loop learns compact students through online distillation, while the outer loop optimizes weights for better knowledge distillation from diverse students. This dynamic weighting strategy eliminates the need for a teacher model, reducing computational requirements. Our evaluations show that teacher-less methods can match or surpass teacher-supervised approaches.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Teaching and Learning Methods · Online and Blended Learning · Educational Assessment and Improvement
MethodsKnowledge Distillation
