mmT5: Modular Multilingual Pre-Training Solves Source Language Hallucinations
Jonas Pfeiffer, Francesco Piccinno, Massimo Nicosia, Xinyi Wang,, Machel Reid, Sebastian Ruder

TL;DR
mmT5 is a modular multilingual model that significantly reduces language hallucinations and improves zero-shot language generation accuracy across 40+ languages by disentangling language-specific and language-agnostic information.
Contribution
The paper introduces mmT5, a modular pre-training approach with language-specific modules that enhances multilingual generation and reduces hallucinations compared to existing models.
Findings
Zero-shot language correctness increased from 7% to 99%.
Outperforms mT5 on natural language understanding and generation tasks.
Effectively addresses source language hallucination in multilingual models.
Abstract
Multilingual sequence-to-sequence models perform poorly with increased language coverage and fail to consistently generate text in the correct target language in few-shot settings. To address these challenges, we propose mmT5, a modular multilingual sequence-to-sequence model. mmT5 utilizes language-specific modules during pre-training, which disentangle language-specific information from language-agnostic information. We identify representation drift during fine-tuning as a key limitation of modular generative models and develop strategies that enable effective zero-shot transfer. Our model outperforms mT5 at the same parameter sizes by a large margin on representative natural language understanding and generation tasks in 40+ languages. Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%, thereby greatly alleviating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsGated Linear Unit · Attention Is All You Need · fail · Softmax · Layer Normalization · Inverse Square Root Schedule · Byte Pair Encoding · Dropout · Linear Layer · Attention Dropout
