Data Curation Through the Lens of Spectral Dynamics: Static Limits, Dynamic Acceleration, and Practical Oracles
Yizhou Zhang, Lun Du

TL;DR
This paper analyzes how different data curation strategies affect neural model training by examining their spectral properties, revealing limitations of static pruning and potential benefits of dynamic, oracle-guided approaches for accelerating learning.
Contribution
It formalizes data curation as spectral reweighting, proving static pruning cannot alter asymptotic scaling, and demonstrating that dynamic reweighting can theoretically accelerate training.
Findings
Static pruning cannot change spectral tail exponent.
Dynamic, oracle-based reweighting can accelerate learning.
Practical systems can only approximate ideal spectral tracking.
Abstract
Large-scale neural models are increasingly trained with data pruning, synthetic data generation, cross-model distillation, reinforcement learning from human feedback (RLHF), and difficulty-based sampling. While several of these data-centric strategies reliably improve training efficiency and downstream performance, others fail to provide meaningful gains -- most notably self-generated synthetic data, which often increases dataset volume without enhancing model capability. We formalize data curation as reweighting the sampling distribution and map its effect onto the eigenstructure of the data-induced operator. Our first main result shows that \textbf{static pruning induces a bounded operator and therefore cannot change the spectral tail exponent}; it provides at most finite-region improvements and cannot alter asymptotic neural scaling. Our second result analyzes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis
