Problems with Chinchilla Approach 2: Systematic Biases in IsoFLOP Parabola Fits
Eric Czech, Zhiwei Xu, Yael Elmatad, Yixin Wang, William Held

TL;DR
This paper identifies systematic biases in the widely used Chinchilla Approach 2 for fitting neural scaling laws, proposes improvements, and demonstrates more accurate, unbiased inference methods with practical advantages.
Contribution
It introduces Variable Projection to address biases in Approach 2, offering a more stable, unbiased, and scalable method for neural scaling law fitting.
Findings
Chinchilla Approach 2 introduces biases leading to underallocation and unnecessary compute.
Approach 3 reduces biases but has perceived drawbacks that are addressable.
Variable Projection enables unbiased, well-conditioned inference on all loss surface parameters.
Abstract
Chinchilla Approach 2 is among the most widely used methods for fitting neural scaling laws. Its parabolic approximation introduces systematic biases in compute-optimal allocation estimates, even on noise-free synthetic data. Applied to published Llama 3 IsoFLOP data at open frontier compute scales, these biases imply a parameter underallocation corresponding to 6.5% of the FLOP training budget and $1.4M (90% CI: $412K-$2.9M) in unnecessary compute at 50% H100 MFU. Simulated multimodal model misallocations show even greater opportunity costs due to higher loss surface asymmetry. Three sources of this error are examined: IsoFLOP sampling grid width (Taylor approximation accuracy), uncentered IsoFLOP sampling, and loss surface asymmetry (). Chinchilla Approach 3 largely eliminates these biases but is often regarded as less data-efficient,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
