Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages
Fabian David Schmidt, Philipp Borchert, Ivan Vuli\'c, Goran Glava\v{s}

TL;DR
This paper introduces MT-LLMs, a novel approach combining machine translation encoders with large language models through self-distillation, significantly enhancing cross-lingual natural language understanding in over 200 languages.
Contribution
The work presents a new method of integrating MT encoders into LLMs via self-distillation, enabling effective multilingual NLU especially for low-resource languages.
Findings
MT-LLMs outperform translate-test methods across multiple NLU tasks
The approach improves NLU performance in over 127 low-resource languages
MT-LLMs maintain multilingual alignment and reduce translation errors
Abstract
LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through language modeling on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through language modeling training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
