Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan, Rapha\"el Merx, Jey Han Lau

TL;DR
This paper addresses domain shift in low-resource neural machine translation by introducing a hybrid RAG-based framework that significantly improves translation quality in unseen domains, leveraging retrieval and LLM refinement.
Contribution
It proposes a novel hybrid NMT + RAG framework that enhances low-resource translation under domain shift, demonstrating the importance of retrieval volume over retrieval method.
Findings
The RAG approach recovers 8.10 chrF++ points in translation quality.
Performance is primarily influenced by the number of retrieved examples.
LLM acts as a robust safety net, repairing severe failures.
Abstract
Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science
