Data Contamination: From Memorization to Exploitation

Inbal Magar; Roy Schwartz

arXiv:2203.08242·cs.CL·March 17, 2022

Data Contamination: From Memorization to Exploitation

Inbal Magar, Roy Schwartz

PDF

1 Repo

TL;DR

This paper investigates how pretrained language models utilize contaminated datasets during training, distinguishing between memorization and exploitation, and emphasizes the importance of analyzing data to ensure genuine language understanding.

Contribution

The paper introduces a method to quantify and differentiate between memorization and exploitation of contaminated data in pretrained language models.

Findings

01

Exploitation occurs in some cases, but not always.

02

Model size and data duplication influence memorization and exploitation.

03

Analyzing data contamination is crucial for genuine NLP progress.

Abstract

Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

schwartz-lab-nlp/data_contamination
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Attention Dropout · Weight Decay · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Layer Normalization