MergeDistill: Merging Pre-trained Language Models using Distillation
Simran Khanuja, Melvin Johnson, Partha Talukdar

TL;DR
MergeDistill is a framework that merges pre-trained language models through task-agnostic knowledge distillation, enabling the creation of smaller, efficient models that can outperform larger teachers by leveraging their strengths.
Contribution
The paper introduces MergeDistill, a novel method for merging pre-trained language models via distillation, addressing limitations of multilingual models and model capacity.
Findings
Merged models perform competitively or better than larger teacher models.
Teacher selection significantly impacts student model performance.
Framework enables leveraging multiple models with minimal dependencies.
Abstract
Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in cross-lingual transfer, but they often lead to an inequitable representation of languages due to limited capacity, skewed pre-training data, and sub-optimal vocabularies. This has prompted the creation of an ever-growing pre-trained model universe, where each model is trained on large amounts of language or domain specific data with a carefully curated, linguistically informed vocabulary. However, doing so brings us back full circle and prevents one from leveraging the benefits of multilinguality. To address the gaps at both ends of the spectrum, we propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies, using task-agnostic knowledge distillation. We demonstrate the applicability of our framework in a practical setting by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
