Scaling Low-Resource MT via Synthetic Data Generation with LLMs
Ona de Gibert, Joseph Attieh, Teemu Vahtola, Mikko Aulamo, Zihao Li, Ra\'ul V\'azquez, Tiancheng Hu, J\"org Tiedemann

TL;DR
This paper demonstrates that synthetic data generated by large language models can significantly enhance low-resource machine translation, especially when combined with effective training strategies and evaluated across diverse languages.
Contribution
The study introduces a method for generating high-quality synthetic parallel data using LLMs, extending to many language pairs, and provides practical insights into its application for low-resource MT.
Findings
Synthetic data improves low-resource MT performance
Effective training regimes enhance benefits of synthetic data
Public repository SynOPUS facilitates future research
Abstract
We investigate the potential of LLM-generated synthetic data for improving low-resource Machine Translation (MT). Focusing on seven diverse target languages, we construct a document-level synthetic corpus from English Europarl, and extend it via pivoting to 147 additional language pairs. Automatic and human evaluation confirm its overall high quality. We study its practical application by (i) identifying effective training regimes, (ii) comparing our data with the HPLT dataset, (iii) studying the effect of varying training data size, and (iiii) testing its utility beyond English-centric MT. Finally, we introduce SynOPUS, a public repository for synthetic parallel datasets. Our findings show that LLM-generated synthetic data, even when noisy, can substantially improve MT performance for low-resource languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Helsinki-NLP/opus-mt-synthetic-en-eumodel· 24 dl· ♡ 124 dl♡ 1
- 🤗Helsinki-NLP/opus-mt-synthetic-en-gdmodel· 11 dl11 dl
- 🤗Helsinki-NLP/opus-mt-synthetic-en-ismodel· 21 dl· ♡ 121 dl♡ 1
- 🤗Helsinki-NLP/opus-mt-synthetic-en-kamodel· 26 dl26 dl
- 🤗Helsinki-NLP/opus-mt-synthetic-en-ukmodel· 61 dl61 dl
- 🤗Helsinki-NLP/opus-mt-synthetic-en-mkmodel· 17 dl17 dl
- 🤗Helsinki-NLP/opus-mt-synthetic-en-somodel· 159 dl· ♡ 2159 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
