Toward domain-specific machine translation and quality estimation systems

Javad Pourmostafa Roshan Sharami

arXiv:2603.24955·cs.CL·March 27, 2026

Toward domain-specific machine translation and quality estimation systems

Javad Pourmostafa Roshan Sharami

PDF

Open Access

TL;DR

This paper presents data-focused methods for adapting machine translation and quality estimation systems to specialized domains, improving performance and efficiency through targeted data selection, domain adaptation, and innovative training strategies.

Contribution

It introduces novel domain adaptation techniques for MT and QE, including similarity-based data selection, staged training pipelines, and QE-guided in-context learning for large language models.

Findings

01

Targeted data selection improves translation quality with less data.

02

Domain adaptation strategies enhance performance across languages and resource settings.

03

QE-guided in-context learning outperforms retrieval-based methods and reduces reference dependence.

Abstract

Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification