How Much Data is Enough Data? Fine-Tuning Large Language Models for In-House Translation: Performance Evaluation Across Multiple Dataset Sizes
Inacio Vieira, Will Allred, S\'eamus Lankford, Sheila Castilho, Andy, Way

TL;DR
This study evaluates how varying dataset sizes impact the performance of fine-tuned Llama 3 8B models for organisation-specific translation tasks, demonstrating that larger datasets improve translation quality across multiple languages.
Contribution
It provides empirical evidence on the relationship between dataset size and translation quality when fine-tuning LLMs with translation memories for domain-specific translation.
Findings
Larger datasets lead to significant improvements in BLEU and COMET scores.
Fine-tuning with only 1k or 2k examples can decrease performance.
Integrating TMs with LLMs can create effective, domain-specific translation models.
Abstract
Decoder-only LLMs have shown impressive performance in MT due to their ability to learn from extensive datasets and generate high-quality translations. However, LLMs often struggle with the nuances and style required for organisation-specific translation. In this study, we explore the effectiveness of fine-tuning Large Language Models (LLMs), particularly Llama 3 8B Instruct, leveraging translation memories (TMs), as a valuable resource to enhance accuracy and efficiency. We investigate the impact of fine-tuning the Llama 3 model using TMs from a specific organisation in the software sector. Our experiments cover five translation directions across languages of varying resource levels (English to Brazilian Portuguese, Czech, German, Finnish, and Korean). We analyse diverse sizes of training datasets (1k to 207k segments) to evaluate their influence on translation quality. We fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsSparse Evolutionary Training · LLaMA
