Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

Danlu Chen; Ka Sing He; Jiahe Tian; Chenghao Xiao; Zhaofeng Wu; Taylor Berg-Kirkpatrick; Freda Shi

arXiv:2603.25222·cs.CL·March 27, 2026

Translation or Recitation? Calibrating Evaluation Scores for Machine Translation of Extremely Low-Resource Languages

Danlu Chen, Ka Sing He, Jiahe Tian, Chenghao Xiao, Zhaofeng Wu, Taylor Berg-Kirkpatrick, Freda Shi

PDF

Open Access

TL;DR

This paper introduces the FRED Difficulty Metrics to contextualize and interpret performance scores in extremely low-resource machine translation, addressing variability caused by dataset and pre-training factors.

Contribution

The paper proposes dataset-intrinsic metrics to better understand and compare low-resource MT results, highlighting factors like train-test overlap and tokenization issues.

Findings

01

Significant variability in results is due to dataset overlap and pre-training exposure.

02

Extinct and indigenous languages face tokenization challenges affecting translation quality.

03

Providing these metrics improves transparency and reliability in low-resource MT evaluation.

Abstract

The landscape of extremely low-resource machine translation (MT) is characterized by perplexing variability in reported performance, often making results across different language pairs difficult to contextualize. For researchers focused on specific language groups -- such as ancient languages -- it is nearly impossible to determine if breakthroughs reported in other contexts (e.g., native African or American languages) result from superior methodologies or are merely artifacts of benchmark collection. To address this problem, we introduce the FRED Difficulty Metrics, which include the Fertility Ratio (F), Retrieval Proxy (R), Pre-training Exposure (E), and Corpus Diversity (D) and serve as dataset-intrinsic metrics to contextualize reported scores. These metrics reveal that a significant portion of result variability is explained by train-test overlap and pre-training exposure rather…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Language and cultural evolution · Translation Studies and Practices