An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Hong Jun Jeon; Benjamin Van Roy

arXiv:2212.01365·cs.LG·October 20, 2023

An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws

Hong Jun Jeon, Benjamin Van Roy

PDF

Open Access

TL;DR

This paper develops an information-theoretic framework to analyze and derive compute-optimal neural scaling laws, revealing that optimal resource allocation shifts towards larger models as input complexity increases.

Contribution

It introduces a mathematical theory for compute-optimal neural scaling based on simplified models and error bounds, supported by empirical validation.

Findings

01

Linear compute-optimal scaling law identified

02

Optimal resource allocation favors larger models with increased input complexity

03

Provides new insights into model and data size trade-offs

Abstract

We study the compute-optimal trade-off between model and training data set sizes for large neural networks. Our result suggests a linear relation similar to that supported by the empirical analysis of chinchilla. While that work studies transformer-based large language models trained on the MassiveText corpus gopher, as a starting point for development of a mathematical theory, we focus on a simpler learning model and data generating process, each based on a neural network with a sigmoidal output unit and single hidden layer of ReLU activation units. We introduce general error upper bounds for a class of algorithms which incrementally update a statistic (for example gradient descent). For a particular learning model inspired by barron 1993, we establish an upper bound on the minimal information-theoretically achievable expected error as a function of model and data set sizes. We then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Ferroelectric and Negative Capacitance Devices · Model Reduction and Neural Networks

MethodsChinchilla