Why Has Predicting Downstream Capabilities of Frontier AI Models with   Scale Remained Elusive?

Rylan Schaeffer; Hailey Schoelkopf; Brando Miranda; Gabriel Mukobi,; Varun Madan; Adam Ibrahim; Herbie Bradley; Stella Biderman; Sanmi Koyejo

arXiv:2406.04391·cs.LG·February 7, 2025·2 cites

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Rylan Schaeffer, Hailey Schoelkopf, Brando Miranda, Gabriel Mukobi,, Varun Madan, Adam Ibrahim, Herbie Bradley, Stella Biderman, Sanmi Koyejo

PDF

Open Access

TL;DR

This paper investigates why predicting the scaling behavior of downstream capabilities in advanced AI models remains difficult, identifying key factors affecting predictability and proposing directions for more reliable evaluation methods.

Contribution

It reveals how downstream performance metrics degrade the statistical relationship with scale and suggests that scaling laws for incorrect choices could improve predictability.

Findings

01

Downstream performance metrics involve complex probability fluctuations.

02

Scaling laws for incorrect choices may be more predictable.

03

Downstream evaluation predictability can be improved by understanding probability mass fluctuations.

Abstract

Predicting changes from scaling advanced AI systems is a desirable property for engineers, economists, governments and industry alike, and, while a well-established literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper identifies a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we demonstrate that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then pinpoint the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)