Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model
Mingqi Li, Fei Ding, Dan Zhang, Long Cheng, Hongxin Hu, Feng Luo

TL;DR
This paper introduces MMKD, a multi-level knowledge distillation approach that enhances multilingual language models by aligning semantic representations at various levels, leading to improved cross-lingual understanding especially for low-resource languages.
Contribution
The paper proposes a novel multi-level distillation framework that leverages rich semantic knowledge from English BERT to improve multilingual models.
Findings
Outperforms baseline models on XNLI and XQuAD benchmarks.
Achieves comparable performance on PAWS-X.
Significant gains on low-resource languages.
Abstract
Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the semantic structure of representation, and thus could not optimize their performance. In this paper, we propose Multi-level Multilingual Knowledge Distillation (MMKD), a novel method for improving multilingual language models. Specifically, we employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT. We propose token-, word-, sentence-, and structure-level alignment objectives to encourage multiple levels of consistency between source-target pairs and correlation similarity between teacher and student models. We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD. Experimental results show that MMKD outperforms other baseline models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dense Connections · WordPiece · Linear Warmup With Linear Decay
