TL;DR
This paper enhances indigenous language machine translation by generating synthetic data, applying language-specific preprocessing, and fine-tuning multilingual models, resulting in improved translation quality for low-resource languages.
Contribution
It introduces a synthetic data augmentation approach combined with language-specific preprocessing to improve NMT for indigenous languages.
Findings
Synthetic data improves translation quality for Guarani and Quechua.
Language-specific preprocessing reduces corpus artifacts.
Limitations exist for highly agglutinative languages like Aymara.
Abstract
Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Language and cultural evolution · Big Data and Digital Economy
