TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation
Armel Zebaze, Beno\^it Sagot, Rachel Bawden

TL;DR
TopXGen leverages large language models to generate diverse, high-quality parallel data for low-resource languages, improving translation performance through synthetic data creation and backtranslation.
Contribution
It introduces TopXGen, a novel LLM-based method for generating topic-diverse parallel data in low-resource languages, enhancing low-resource machine translation.
Findings
Improves translation quality for low-resource languages.
Generates high-quality, diverse parallel data.
Enhances fine-tuning and in-context learning performance.
Abstract
LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
