TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

Armel Zebaze; Beno\^it Sagot; Rachel Bawden

arXiv:2508.08680·cs.CL·August 13, 2025

TopXGen: Topic-Diverse Parallel Data Generation for Low-Resource Machine Translation

Armel Zebaze, Beno\^it Sagot, Rachel Bawden

PDF

3 Datasets

TL;DR

TopXGen leverages large language models to generate diverse, high-quality parallel data for low-resource languages, improving translation performance through synthetic data creation and backtranslation.

Contribution

It introduces TopXGen, a novel LLM-based method for generating topic-diverse parallel data in low-resource languages, enhancing low-resource machine translation.

Findings

01

Improves translation quality for low-resource languages.

02

Generates high-quality, diverse parallel data.

03

Enhances fine-tuning and in-context learning performance.

Abstract

LLMs have been shown to perform well in machine translation (MT) with the use of in-context learning (ICL), rivaling supervised models when translating into high-resource languages (HRLs). However, they lag behind when translating into low-resource language (LRLs). Example selection via similarity search and supervised fine-tuning help. However the improvements they give are limited by the size, quality and diversity of existing parallel datasets. A common technique in low-resource MT is synthetic parallel data creation, the most frequent of which is backtranslation, whereby existing target-side texts are automatically translated into the source language. However, this assumes the existence of good quality and relevant target-side texts, which are not readily available for many LRLs. In this paper, we present \textsc{TopXGen}, an LLM-based approach for the generation of high quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.