TL;DR
This paper investigates how pretrained language models utilize contaminated datasets during training, distinguishing between memorization and exploitation, and emphasizes the importance of analyzing data to ensure genuine language understanding.
Contribution
The paper introduces a method to quantify and differentiate between memorization and exploitation of contaminated data in pretrained language models.
Findings
Exploitation occurs in some cases, but not always.
Model size and data duplication influence memorization and exploitation.
Analyzing data contamination is crucial for genuine NLP progress.
Abstract
Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Attention Dropout · Weight Decay · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Dropout · Layer Normalization
