Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan; Christopher Driggers-Ellis; Christan Grant; Daisy Zhe Wang

arXiv:2601.03135·cs.CL·May 21, 2026

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

PDF

1 Video

TL;DR

This paper enhances indigenous language machine translation by generating synthetic data, applying language-specific preprocessing, and fine-tuning multilingual models, resulting in improved translation quality for low-resource languages.

Contribution

It introduces a synthetic data augmentation approach combined with language-specific preprocessing to improve NMT for indigenous languages.

Findings

01

Synthetic data improves translation quality for Guarani and Quechua.

02

Language-specific preprocessing reduces corpus artifacts.

03

Limitations exist for highly agglutinative languages like Aymara.

Abstract

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing· underline

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Big Data and Digital Economy