A Taxonomy for Data Contamination in Large Language Models
Medha Palavalli, Amanda Bertsch, Matthew R. Gormley

TL;DR
This paper introduces a taxonomy categorizing data contamination types in large language models, analyzing their impact on downstream tasks like summarization and question answering to understand contamination effects.
Contribution
It provides a novel taxonomy for data contamination in LLMs and analyzes how different contamination types affect model performance on key NLP tasks.
Findings
Certain contamination types significantly inflate model performance.
Contamination from test set variants can evade detection.
Impact varies across different NLP tasks.
Abstract
Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks -- summarization and question answering -- revealing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Digital and Cyber Forensics
