Scale Dependent Data Duplication

Joshua Kazdan; Noam Levi; Rylan Schaeffer; Jessica Chudnovsky; Abhay Puri; Bo He; Mehmet Donmez; Sanmi Koyejo; David Donoho

arXiv:2603.06603·cs.LG·March 10, 2026

Scale Dependent Data Duplication

Joshua Kazdan, Noam Levi, Rylan Schaeffer, Jessica Chudnovsky, Abhay Puri, Bo He, Mehmet Donmez, Sanmi Koyejo, David Donoho

PDF

Open Access

TL;DR

This paper investigates how data duplication, especially semantic duplicates, affects model training at different scales, revealing scale-dependent behaviors and providing scaling laws to predict impacts on large models.

Contribution

It introduces the concept of scale-dependent data duplication effects, demonstrates how semantic duplicates influence training as models grow, and derives scaling laws for better data curation at scale.

Findings

01

Gradient alignment increases with model capability for semantic duplicates.

02

Nearest-neighbor similarities deviate from baseline at large corpus sizes.

03

Limited data uniqueness causes significant loss in large model training.

Abstract

Data duplication during pretraining can degrade generalization and lead to memorization, motivating aggressive deduplication pipelines. However, at web scale, it is unclear what constitutes a ``duplicate'': beyond surface-form matches, semantically equivalent documents (e.g. translations) may induce redundant training signals once models become sufficiently capable. Practically, this means that semantic duplicates operate increasingly like exact duplicates during training. We present evidence that duplication is scale-dependent in two ways. First, as model capability increases, cross-entropy loss gradients for semantically equivalent documents become more aligned. Smaller models, by contrast, produce gradients that reflect surface similarity (e.g., shared tokens) rather than semantic similarity. Second, we embedded all 192 million FineWeb-Edu-Dedup documents using EmbeddingGemma-300m.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Authorship Attribution and Profiling · Benford’s Law and Fraud Detection