Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

David Samuel Setiawan; Rapha\"el Merx; Jey Han Lau

arXiv:2601.09982·cs.CL·February 17, 2026

Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

David Samuel Setiawan, Rapha\"el Merx, Jey Han Lau

PDF

Open Access 1 Video

TL;DR

This paper addresses domain shift in low-resource neural machine translation by introducing a hybrid RAG-based framework that significantly improves translation quality in unseen domains, leveraging retrieval and LLM refinement.

Contribution

It proposes a novel hybrid NMT + RAG framework that enhances low-resource translation under domain shift, demonstrating the importance of retrieval volume over retrieval method.

Findings

01

The RAG approach recovers 8.10 chrF++ points in translation quality.

02

Performance is primarily influenced by the number of retrieved examples.

03

LLM acts as a robust safety net, repairing severe failures.

Abstract

Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science