A Recipe of Parallel Corpora Exploitation for Multilingual Large Language Models
Peiqin Lin, Andr\'e F. T. Martins, Hinrich Sch\"utze

TL;DR
This study investigates how to best leverage parallel corpora to improve multilingual large language models, emphasizing data quality, training objectives, and model size for optimal performance across diverse languages and tasks.
Contribution
It provides comprehensive insights into effective strategies for exploiting parallel corpora, including filtering techniques and the impact of model size and training objectives.
Findings
Filtering noisy translations is crucial for effective use of parallel corpora.
Small datasets (~10K sentences) can achieve comparable results to larger datasets.
Machine translation objectives alone yield the best performance among tested methods.
Abstract
Recent studies have highlighted the potential of exploiting parallel corpora to enhance multilingual large language models, improving performance in both bilingual tasks, e.g., machine translation, and general-purpose tasks, e.g., text classification. Building upon these findings, our comprehensive study aims to identify the most effective strategies for leveraging parallel corpora. We investigate the impact of parallel corpora quality and quantity, training objectives, and model size on the performance of multilingual large language models enhanced with parallel corpora across diverse languages and tasks. Our analysis reveals several key insights: (i) filtering noisy translations is essential for effectively exploiting parallel corpora, while language identification and short sentence filtering have little effect; (ii) even a corpus with just 10K parallel sentences can yield results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
