An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws
Hong Jun Jeon, Benjamin Van Roy

TL;DR
This paper develops an information-theoretic framework to analyze and derive compute-optimal neural scaling laws, revealing that optimal resource allocation shifts towards larger models as input complexity increases.
Contribution
It introduces a mathematical theory for compute-optimal neural scaling based on simplified models and error bounds, supported by empirical validation.
Findings
Linear compute-optimal scaling law identified
Optimal resource allocation favors larger models with increased input complexity
Provides new insights into model and data size trade-offs
Abstract
We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We introduce general error upper bounds for a class of algorithms which incrementally update a statistic (for example gradient descent). For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Ferroelectric and Negative Capacitance Devices · Model Reduction and Neural Networks
MethodsChinchilla
