Predicting Emergent Abilities with Infinite Resolution Evaluation
Shengding Hu, Xin Liu, Xu Han, Xinrong Zhang, Chaoqun He, Weilin Zhao,, Yankai Lin, Ning Ding, Zebin Ou, Guoyang Zeng, Zhiyuan Liu, Maosong Sun

TL;DR
This paper introduces PassUntil, an evaluation method with infinite resolution, enabling precise measurement of small model performance improvements and revealing a strict task scaling law and accelerated emergent abilities.
Contribution
The study presents PassUntil for high-resolution evaluation, discovers a predictable task scaling law, and quantitatively analyzes emergent abilities in large language models.
Findings
PassUntil achieves near-infinite measurement resolution.
A strict task scaling law accurately predicts performance.
Identification of accelerated emergence phenomena.
Abstract
The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper presents a pioneering open-source effort to predict task performance in large language models, aiming to significantly contribute to and encourage future research in this area. 2. The authors introduce PASSUNTIL, a novel evaluation strategy that appears to provide a more equitable assessment compared to existing metrics, showcasing their innovative approach to addressing the challenges in the field. 3. The proposed scaling laws demonstrate a strong fit with the data across severa
1. The derivation of Equation (3) appears to have some discrepancies. (Please refer to Question 1) 2. The number of evaluation datasets used in this study is somewhat limited in comparison to previous works on scaling laws. (Please see Question 2) 3. The task scaling law seems somewhat arbitrary, as some tasks require standard scaling laws while others necessitate super scaling laws. (Please refer to Question 3) Overall, this paper serves as a valuable starting point, but it lacks a comprehen
Overall, I think this is a good paper. I think it is well motivated, thorough and insightful. I would be happy to increase my score if a number of modifications are made (or if the authors tell me why I'm mistaken!).
Here, I order my feedback in sequential order based on moving through the paper top to bottom. > We hypothesize that the perceived discontinuity from trivial to excellent performance might stem from limited evaluation resolution. By employing a more nuanced resolution, one could potentially uncover the scaling law for tasks. Our hypothesis diverges significantly from that of Schaeffer et al. (2023), I think this misstates Schaeffer et al. (2023). Their abstract states, “we provide evidence tha
1. Proposed the evaluation strategy "PASSUNTIL" with theoretically infinite resolution, enabling the prediction of task performance and the derivation of the task scaling law. 2. Analyzed emergent abilities using a mathematical definition, challenging prevailing hypotheses and introducing an alternative circuit hypothesis based on theoretical derivations. 3. Conducted experiments to validate the theoretical analysis and provided the first open-source attempt to investigate the predictability o
1. Motivation vs. actual design of the task scaling law: The paper states that “Despite the predictable decrement in LLM loss, task performance improvements are twisted during scaling” and "First, these works concentrate on training and validation loss metrics, which do not reliably predict task performance." This suggests that the loss metrics may not be highly dependable for predicting task performance. However, the task scaling law design still focuses on the correlation between PU and test
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Software Engineering Research
