Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data   Contamination's Impact on Machine Translation

Muhammed Yusuf Kocyigit; Eleftheria Briakou; Daniel Deutsch; Jiaming; Luo; Colin Cherry; Markus Freitag

arXiv:2501.18771·cs.CL·February 3, 2025

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming, Luo, Colin Cherry, Markus Freitag

PDF

Open Access 1 Video

TL;DR

This study systematically investigates how data contamination inflates machine translation evaluation scores in large language models, revealing significant over-estimations especially at larger scales and under certain contamination conditions.

Contribution

It provides a controlled, large-scale analysis of data contamination effects on machine translation evaluation, highlighting the magnitude and factors influencing score inflation.

Findings

01

Contamination inflates BLEU scores significantly, up to 30 points.

02

8B models experience 2.5 times more inflation than 1B models.

03

Source and target contamination both cause over-estimation, with combined contamination having the largest effect.

Abstract

Data contamination -- the accidental consumption of evaluation examples within the pre-training data -- can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination’s Impact on Machine Translation· slideslive

Taxonomy

TopicsNatural Language Processing Techniques