Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh; Mohammad Javad Dousti

arXiv:2511.04406·cs.CL·November 7, 2025

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning

Mohammad Amin Ghanizadeh, Mohammad Javad Dousti

PDF

Open Access 1 Video

TL;DR

This paper introduces a data selection method for machine translation fine-tuning that uses a learnability score and batch strategy to improve data efficiency, computational cost, and translation quality.

Contribution

It proposes a novel data selection approach combining learnability scores with batch interdependency considerations for more effective MT fine-tuning.

Findings

01

Achieves up to fivefold data efficiency improvement over iid baseline.

02

Reduces training data requirements by 24% with cached embeddings.

03

Enhances translation performance compared to random data selection.

Abstract

Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification