Measuring Fingerprints of Web-filtered Text Datasets and Fingerprint Propagation Through Training
Youssef Mansour, Reinhard Heckel

TL;DR
This paper reveals that large language model pretraining datasets contain unique fingerprints stemming from their curation processes, which can be detected and propagated through training, affecting model generalization and transparency.
Contribution
It demonstrates the existence of dataset-specific fingerprints in LLM pretraining data and shows how these fingerprints propagate through training, providing new insights into dataset biases and model behavior.
Findings
Neural networks can classify datasets from single text sequences better than humans.
Differences in filtering and processing induce detectable fingerprints.
Fingerprints affect cross-dataset generalization and can reveal training data characteristics.
Abstract
We investigate fingerprints in pretraining datasets for large language models (LLMs) through dataset classification experiments. Building on prior work demonstrating the existence of fingerprints or biases in popular computer vision datasets, we analyze popular open-source pretraining datasets for LLMs derived from CommonCrawl including C4, RefinedWeb, DolmaCC, RedPajama-V2, FineWeb, and DCLM-Baseline. Despite those datasets being obtained with similar curation steps, neural networks can classify surprisingly well which dataset a single text sequence belongs to, significantly better than a human can. This indicates that small differences in filtering and processing pipelines induce fingerprints. Those fingerprints are evident in formatting, vocabulary, and content distributions, and can negatively impact cross-dataset generalization. Additionally, we show that these fingerprints…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
