Can Language Models Discover Scaling Laws?
Haowei Lin, Haotian Ye, Wenzheng Feng, Quzhe Huang, Yujun Li, Hubert Lim, Zhengrui Li, Xiangyu Wang, Jianzhu Ma, Yitao Liang, James Zou

TL;DR
This paper introduces SLDAgent, an evolution-based AI system that autonomously discovers more accurate scaling laws for language models, surpassing human-derived laws in predictive performance across diverse tasks.
Contribution
The paper presents SLDAgent, a novel agent that co-optimizes models and parameters to autonomously discover superior scaling laws, advancing automated scientific discovery in AI.
Findings
SLDAgent outperforms human-derived laws in extrapolation accuracy
Discovered laws are practically useful in pretraining and finetuning
The approach demonstrates AI's capability for autonomous scientific discovery
Abstract
Discovering scaling laws for predicting model performance at scale is a fundamental and open-ended challenge, mostly reliant on slow, case specific human experimentation. To investigate the potential for LLMs to automate this process, we collect over 5,000 experiments from existing literature and curate eight diverse scaling law discovery tasks. While existing agents struggle to produce accurate law formulas, this paper introduces SLDAgent, an evolution-based agent that co-optimize the scaling law model and the parameters, enabling it to autonomously explore complex relationships between variables. For the first time, we demonstrates that SLDAgent can automatically discover laws that exhibit consistently more accurate extrapolation than their established, human-derived counterparts across all tasks. Through comprehensive analysis, we elucidate why these discovered laws are superior and…
Peer Reviews
Decision·ICLR 2026 Poster
- Overall I enjoyed reading the paper - I noticed that a lot of the implementation detail questions I wrote down during my reading got resolved as I read through it, which suggests that the grounds were well covered. - The construction of the training-extrapolation set is sensible and reflective of real world scenarios ("For each task, we hold out an extrapolation test set from the dataset by selecting data corresponding to the largest model or dataset sizes.") - Good controlled comparisons: th
The weaknesses below are not critical: - Sparse discussion of the possible effects of contamination and measures taken to prevent this - Coverage of unexpected or harder-to-predict scaling laws like inverse scaling or U-shaped scaling I provide more context about these points in the question section, but I think neither of these limitations are grounds for rejection.
1. The authors study a new application of LLM agents, namely the discovery of scaling laws for foundation models. This contributes to the growing body of literature on the potential of LLMs to augment scientific discovery. 2. Discovering better scaling laws has the potential to feed directly back to improving LLMs. 3. The paper is well-written and clear (apart from being too broadly scoped in the introduction, as I complain about below).
1. The biggest challenge of scaling laws is actually *formulating the set of independent & dependent variables* that make sense to model. For instance, recent works had the insight to propose new axes of scaling such as the vocabulary size $V$ (Tao et al., 2024), the amount of unique data $U$ (Muennighoff et al., 2024), or the domain mixture (Ye et al., 2024). Therefore, the sense of "scaling law discovery" here is limited — the agent is not asked to discover entirely new axes of scaling, but in
1. Experiments are extensive, evaluating a variety of baselines on many tasks. Experiments with SLDAgent + different models show that SLDAgent performance can scale with model size. Analyses are thorough, and the paper also shows that SLDAgent can be used downstream in real-world settings (hyperparameter selection, LLM selection). 2. Novelty: While existing works have studied AI agents for science, none have been proposed specifically for evaluating scaling laws. 3. The paper is well-written a
1. The comparisons between SLDAgent and other methods is not totally fair. - 1a. Comparisons between SLDAgent and other baselines do not control for the number of LLM calls. Results for SLDAgent reflect performance after 50 rounds – do these SDLAgent runs then involve many more LLM calls (or other forms of compute) than other agent baselines? - 1b. It is hard to know how much of the performance of SLDAgent is driven by the LLM versus the optimization and generator subroutines; while
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling
