Dynamic Jointly Batch Selection for Data Efficient Machine Translation Fine-Tuning
Mohammad Amin Ghanizadeh, Mohammad Javad Dousti

TL;DR
This paper introduces a data selection method for machine translation fine-tuning that uses a learnability score and batch strategy to improve data efficiency, computational cost, and translation quality.
Contribution
It proposes a novel data selection approach combining learnability scores with batch interdependency considerations for more effective MT fine-tuning.
Findings
Achieves up to fivefold data efficiency improvement over iid baseline.
Reduces training data requirements by 24% with cached embeddings.
Enhances translation performance compared to random data selection.
Abstract
Data quality and its effective selection are fundamental to improving the performance of machine translation models, serving as cornerstones for achieving robust and reliable translation systems. This paper presents a data selection methodology specifically designed for fine-tuning machine translation systems, which leverages the synergy between a learner model and a pre-trained reference model to enhance overall training effectiveness. By defining a learnability score, our approach systematically evaluates the utility of data points for training, ensuring that only the most relevant and impactful examples contribute to the fine-tuning process. Furthermore, our method employs a batch selection strategy which considers interdependencies among data points, optimizing the efficiency of the training process while maintaining a focus on data relevance. Experiments on English to Persian and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Data Classification
