The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining
Jiandong Shao, Raphael Tang, Crystina Zhang, Karin Sevegnani, Pontus Stenetorp, Jianfei Yang, Yao Lu

TL;DR
This study investigates the impact of bilingual data on multilingual language models, finding that parallel data is crucial for translation, while cross-lingual understanding can be achieved without bilingual content.
Contribution
The paper provides a controlled analysis showing that parallel bilingual data is essential for translation, but not for cross-lingual reasoning tasks, clarifying the role of mixed-language documents.
Findings
Removing bilingual data reduces translation BLEU scores by 56%.
Parallel data restores 91% of translation performance.
Cross-lingual QA and reasoning are unaffected by bilingual data presence.
Abstract
Multilingual large language models achieve impressive cross-lingual performance despite largely monolingual pretraining. While bilingual data in pretraining corpora is widely believed to enable these abilities, details of its contributions remain unclear. We investigate this question by pretraining models from scratch under controlled conditions, comparing the standard web corpus with a monolingual-only version that removes all multilingual documents. Despite constituting only 2% of the corpus, removing bilingual data causes translation performance to drop 56% in BLEU, while behaviour on cross-lingual QA and general reasoning tasks remains stable, with training curves largely overlapping the baseline. To understand this asymmetry, we categorize bilingual data into parallel (14%), code-switching (72%), and miscellaneous documents (14%) based on the semantic relevance of content in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
