How Much Data is Enough? The Zeta Law of Discoverability in Biomedical Data, featuring the enigmatic Riemann zeta function
Paul M. Thompson

TL;DR
This paper introduces a spectral scaling-law framework based on the Riemann zeta function to predict data sufficiency and performance improvements in biomedical data analysis and AI models.
Contribution
It proposes a novel theoretical model linking spectral properties of data to performance scaling, guiding data collection and model development strategies.
Findings
Spectral decay and signal alignment follow a zeta-like power-law scaling.
Representation learning enhances sample efficiency by concentrating signals in stable spectral modes.
The framework predicts when simpler or more complex models outperform based on data size.
Abstract
How much data is enough to make a scientific discovery? As biomedical datasets scale to millions of samples and AI models grow in capacity, progress increasingly depends on predicting when additional data will substantially improve performance. In practice, model development often relies on empirical scaling curves measured across architectures, modalities, and dataset sizes, with limited theoretical guidance on when performance should improve, saturate, or exhibit cross-over behavior. We propose a scaling-law framework for cross-modal discoverability based on spectral structure of data covariance operators, task-aligned signal projections, and learned representations. Many performance metrics, including AUC, can be expressed in terms of cumulative signal-to-noise energy accumulated across identifiable spectral modes of an encoder and cross-modal operator. Under mild assumptions, this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
