A Taxonomy for Data Contamination in Large Language Models

Medha Palavalli; Amanda Bertsch; Matthew R. Gormley

arXiv:2407.08716·cs.CL·July 12, 2024·1 cites

A Taxonomy for Data Contamination in Large Language Models

Medha Palavalli, Amanda Bertsch, Matthew R. Gormley

PDF

Open Access

TL;DR

This paper introduces a taxonomy categorizing data contamination types in large language models, analyzing their impact on downstream tasks like summarization and question answering to understand contamination effects.

Contribution

It provides a novel taxonomy for data contamination in LLMs and analyzes how different contamination types affect model performance on key NLP tasks.

Findings

01

Certain contamination types significantly inflate model performance.

02

Contamination from test set variants can evade detection.

03

Impact varies across different NLP tasks.

Abstract

Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks -- summarization and question answering -- revealing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data · Data Quality and Management · Digital and Cyber Forensics