Multi-level Distillation of Semantic Knowledge for Pre-training   Multilingual Language Model

Mingqi Li; Fei Ding; Dan Zhang; Long Cheng; Hongxin Hu; Feng Luo

arXiv:2211.01200·cs.CL·November 3, 2022·1 cites

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

Mingqi Li, Fei Ding, Dan Zhang, Long Cheng, Hongxin Hu, Feng Luo

PDF

Open Access

TL;DR

This paper introduces MMKD, a multi-level knowledge distillation approach that enhances multilingual language models by aligning semantic representations at various levels, leading to improved cross-lingual understanding especially for low-resource languages.

Contribution

The paper proposes a novel multi-level distillation framework that leverages rich semantic knowledge from English BERT to improve multilingual models.

Findings

01

Outperforms baseline models on XNLI and XQuAD benchmarks.

02

Achieves comparable performance on PAWS-X.

03

Significant gains on low-resource languages.

Abstract

Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the semantic structure of representation, and thus could not optimize their performance. In this paper, we propose Multi-level Multilingual Knowledge Distillation (MMKD), a novel method for improving multilingual language models. Specifically, we employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT. We propose token-, word-, sentence-, and structure-level alignment objectives to encourage multiple levels of consistency between source-target pairs and correlation similarity between teacher and student models. We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD. Experimental results show that MMKD outperforms other baseline models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Adam · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dense Connections · WordPiece · Linear Warmup With Linear Decay