From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Yingli Shen, Wen Lai, Shuo Wang, Ge Gao, Kangyang Luo, Alexander Fraser, Maosong Sun

TL;DR
This paper introduces TED2025, a large-scale multi-way parallel corpus for 113 languages, and demonstrates that training multilingual LLMs on this data improves cross-lingual performance compared to unaligned data.
Contribution
The paper presents TED2025, a high-quality multi-way parallel corpus, and explores effective strategies for leveraging it to enhance multilingual large language models.
Findings
Models trained on multi-way parallel data outperform those trained on unaligned data.
Multi-way parallel data improves cross-lingual semantic understanding.
Strategies like continued pretraining and instruction tuning benefit from multi-way data.
Abstract
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Mathematics, Computing, and Information Processing · Translation Studies and Practices
