Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair
Assaf Siani, Anna Kernerman, Ilan Kernerman

TL;DR
This paper develops a semi-synthetic dataset for English-Hebrew translation quality estimation, addressing data scarcity and linguistic complexity, and evaluates neural QE models trained on this dataset.
Contribution
It introduces a novel semi-synthetic dataset for under-resourced language pair QE, combining manual evaluation, error simulation, and multiple MT outputs for improved model training.
Findings
Dataset size and error distribution significantly affect model performance
Neural models like BERT and XLM-R can be trained effectively on semi-synthetic data
Addressing linguistic challenges improves QE accuracy for morphology-rich languages
Abstract
Quality estimation (QE) plays a crucial role in machine translation (MT) workflows, as it serves to evaluate generated outputs that have no reference translations and to determine whether human post-editing or full retranslation is necessary. Yet, developing highly accurate, adaptable and reliable QE systems for under-resourced language pairs remains largely unsolved, due mainly to limited parallel corpora and to diverse language-dependent factors, such as with morphosyntactically complex languages. This study presents a semi-synthetic parallel dataset for English-to-Hebrew QE, generated by creating English sentences based on examples of usage that illustrate typical linguistic patterns, translating them to Hebrew using multiple MT engines, and filtering outputs via BLEU-based selection. Each translated segment was manually evaluated and scored by a linguist, and we also incorporated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
