Why Less is More (Sometimes): A Theory of Data Curation
Elvis Dohmatob, Mohammad Pezeshki, Reyhane Askari-Hemmat

TL;DR
This paper develops a theoretical framework explaining when and why using smaller, curated datasets can outperform larger ones in machine learning, supported by empirical validation on ImageNet and insights into LLM strategies.
Contribution
It introduces a novel theory of data curation that predicts when selective data use improves generalization, challenging classical scaling laws.
Findings
Small curated datasets can outperform full datasets under certain conditions.
Exact phase transition curves relate data quality and size to performance.
Empirical validation confirms theoretical predictions on ImageNet.
Abstract
This paper introduces a theoretical framework to resolve a central paradox in modern machine learning: When is it better to use less data? This question has become critical as classical scaling laws suggesting ``more is more'' (Sun et al., 2025) are challenged by methods like LIMO (``less is more'') and s1 (Ye et al., 2025; Muenighoff et al., 2025), which achieve superior performance with small, aggressively curated datasets. Here, we study data curation strategies where an imperfect oracle selects the training examples according to their difficulty and correctness. Our results provide exact scaling law curves for test error under both label-agnostic and label-aware curation rules, revealing when and why keeping only a subset of data can improve generalization. In contrast to classical scaling laws, we show that under certain conditions, small curated datasets can outperform full…
Peer Reviews
Decision·ICLR 2026 Poster
- **Theoretical Rigor and Exactness:** The primary strength is the sophisticated mathematical analysis. By leveraging RMT, the paper derive *exact* asymptotic formulas for the generalization error (Theorems 1, 3). This allows for a precise characterization of the interplay between data quality ($\rho$), oracle quality ($\rho_*$), and data scale ($\phi$), moving far beyond bounds or heuristics. The derivation of how pruning "deforms" the Marchenko-Pastur law governing the data spectrum is technic
**Linear Models:** The theoretical results are derived under the assumption of linear models and Gaussian covariates. This is a known standard simplification for RMT analysis but limits the direct quantitative applicability to deep neural networks on structured data. However, the empirical results do point towards potential qualitative transfer, but would warrant a more complete exploration. **Fixed Oracle:** The framework assumes a fixed pruning oracle ($w_o$). In practice, oracles (e.g., rew
1. The authors use their theory to show that it can predict empirical results to a surprising degree (mean relative error of 1.8% according to App. B), and within these empirical results live the settings they sought to understand: when all data is needed ("more is more") and when curation is needed ("less is more") 2. Authors consider both label-agnostic and label-aware data curation in their theory 3. Significant extra detail for figures and proofs are provided in appendix, answering a few of
1. Arguably the most important feature of the paper is that their theory is predictive of empirical results in a very realistic dataset (ImageNet training of a ViT), but it is not clear from the writing how to use this in practice. In particular: how does someone measure the quality of the generator $\rho$ for a dataset you are given? No commentary on this is given. A bonus would be if the authors gave a step-by-step appendix section for a practitioner on how to use their theory (willing to rais
* The paper is clearly written and well-organized. The paper has both solid theoretical side and empirical evidence. * The paper has deep theoretical building blocks with high dimensional asymptotics of linear models and classification problems. Using both RMT techniques and tools from Feng et al. for classification problems to study the effects of pruning by the oracle. The conclusions are sound as far as I have checked, and the results appear novel to me. * The paper's conclusion provides an a
Major comments: * I think the main inconsistency between the main message and theory is that: both "KE" and "KH" strategies are effectively "less is more". It is just what data should be picked. I think it is a bit of stretch to say "KE" is equivalent to "more is more", while the neural scaling laws scenario should correspond to a very small $\phi$ instead of the "KE" strategy. * The theory relies on linear model with Gaussian covariates, and is limited to the binary classification setup with p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI) · Machine Learning and Data Classification
