Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

Assaf Siani; Anna Kernerman; Ilan Kernerman

arXiv:2603.11743·cs.CL·March 13, 2026

Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair

Assaf Siani, Anna Kernerman, Ilan Kernerman

PDF

Open Access

TL;DR

This paper develops a semi-synthetic dataset for English-Hebrew translation quality estimation, addressing data scarcity and linguistic complexity, and evaluates neural QE models trained on this dataset.

Contribution

It introduces a novel semi-synthetic dataset for under-resourced language pair QE, combining manual evaluation, error simulation, and multiple MT outputs for improved model training.

Findings

01

Dataset size and error distribution significantly affect model performance

02

Neural models like BERT and XLM-R can be trained effectively on semi-synthetic data

03

Addressing linguistic challenges improves QE accuracy for morphology-rich languages

Abstract

Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling