AfroXLMR-Comet: Multilingual Knowledge Distillation with Attention Matching for Low-Resource languages
Joshua Sakthivel Raju, Sanjay S, Jaskaran Singh Walia, Srinivas, Raghav, Vukosi Marivate

TL;DR
This paper introduces AfroXLMR-Comet, a compact multilingual model for low-resource languages that uses hybrid knowledge distillation with attention matching to maintain high performance while significantly reducing size.
Contribution
The paper presents a novel hybrid distillation method combining traditional knowledge distillation with attention matching, tailored for multilingual low-resource language models.
Findings
Achieves over 85% size reduction of the teacher model.
Maintains 85% of the original model's accuracy on African languages.
Demonstrates competitive performance with substantially fewer resources.
Abstract
Language model compression through knowledge distillation has emerged as a promising approach for deploying large language models in resource-constrained environments. However, existing methods often struggle to maintain performance when distilling multilingual models, especially for low-resource languages. In this paper, we present a novel hybrid distillation approach that combines traditional knowledge distillation with a simplified attention matching mechanism, specifically designed for multilingual contexts. Our method introduces an extremely compact student model architecture, significantly smaller than conventional multilingual models. We evaluate our approach on five African languages: Kinyarwanda, Swahili, Hausa, Igbo, and Yoruba. The distilled student model; AfroXLMR-Comet successfully captures both the output distribution and internal attention patterns of a larger teacher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Knowledge Distillation
