Evaluation data contamination in LLMs: how do we measure it and (when)   does it matter?

Aaditya K. Singh; Muhammed Yusuf Kocyigit; Andrew Poulton; David; Esiobu; Maria Lomeli; Gergely Szilvasy; Dieuwke Hupkes

arXiv:2411.03923·cs.CL·November 7, 2024·2 cites

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Aaditya K. Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David, Esiobu, Maria Lomeli, Gergely Szilvasy, Dieuwke Hupkes

PDF

Open Access

TL;DR

This paper introduces ConTAM, a novel method for analyzing evaluation data contamination in large language models, revealing its significant impact on benchmark scores and providing insights for more accurate assessments.

Contribution

The paper proposes ConTAM, a new analysis technique for evaluation data contamination, and offers a comprehensive survey of contamination metrics across multiple benchmarks and models.

Findings

01

Contamination can significantly inflate benchmark scores.

02

Considering only the longest contaminated substring improves detection.

03

Hyperparameter choices greatly influence contamination measurement accuracy.

Abstract

Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that contamination metrics can be assessed based on whether models benefit from the examples they mark contaminated. We propose a novel analysis method called ConTAM, and show with a large scale survey of existing and novel n-gram based contamination metrics across 13 benchmarks and 7 models from 2 different families that ConTAM can be used to better understand evaluation data contamination and its effects. We find that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLibrary Science and Information Systems