Scale Dependent Data Duplication
Joshua Kazdan, Noam Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho

TL;DR
This paper investigates how data duplication, especially semantic duplicates, affects model training at different scales, revealing scale-dependent behaviors and providing scaling laws to predict impacts on large models.
Contribution
It introduces the concept of scale-dependent data duplication effects, demonstrates how semantic duplicates influence training as models grow, and derives scaling laws for better data curation at scale.
Findings
Gradient alignment increases with model capability for semantic duplicates.
Nearest-neighbor similarities deviate from baseline at large corpus sizes.
Limited data uniqueness causes significant loss in large model training.
Abstract
Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Authorship Attribution and Profiling · Benford’s Law and Fraud Detection
