Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis

Felipe Ribeiro Fujita de Mello; Hideyuki Takada

arXiv:2512.11388·cs.CL·December 15, 2025

Improving Translation Quality by Selecting Better Data for LLM Fine-Tuning: A Comparative Analysis

Felipe Ribeiro Fujita de Mello, Hideyuki Takada

PDF

Open Access

TL;DR

This paper evaluates how different data selection methods affect machine translation fine-tuning for open LLMs, highlighting the importance of semantic data quality for improving translation performance.

Contribution

It provides a comparative analysis of five data selectors, demonstrating the superiority of semantic selectors over lexical and geometry-based heuristics in translation quality.

Findings

01

Semantic selectors outperform other heuristics.

02

Small differences in selected data significantly impact performance.

03

Data quality critically influences fine-tuning outcomes.

Abstract

We investigated the impact of data selection on machine translation fine-tuning for open LLMs. Using Japanese-English corpora, we compare five selectors: TF-IDF, COMET Kiwi, QuRate, FD-Score, and random selection, under controlled training conditions. We observed that semantic selectors consistently outperform lexical and geometry-based heuristics, and that even when the selected data differ by less than 3%, the impact on model performance is substantial, underscoring the sensitivity of fine-tuning to data quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification