Scaling Laws for Predicting Downstream Performance in LLMs

Yangyi Chen; Binxuan Huang; Yifan Gao; Zhengyang Wang; Jingfeng Yang,; Heng Ji

arXiv:2410.08527·cs.CL·April 9, 2025

Scaling Laws for Predicting Downstream Performance in LLMs

Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang,, Heng Ji

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a two-stage scaling law approach, FLP, for predicting downstream performance of large language models using pre-training loss, achieving high accuracy with fewer resources and extending to multi-source datasets.

Contribution

The work presents FLP and FLP-M, novel scaling law methods that improve performance prediction accuracy and practicality for large language models across different data sources.

Findings

01

FLP predicts 7B and 13B LLM performance within 5-10% error.

02

FLP-M extends to multi-source datasets, maintaining accuracy within 10%.

03

Outperforms FLOPs-to-Performance baseline in prediction tasks.

Abstract

Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach FLP consists of first estimating a function that maps computational resources (e.g., FLOPs) to the pre-training Loss using a series of fully-converged sampling models, followed by mapping the pre-training loss to downstream task Performance using the intermediate models with emerged…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

- The paper tackles an important problem of building scaling laws to measure the downstream task performance, especially when we know that task-specific behaviour emerges at different scales and smaller scale LMs might not be able to accurately capture the predictive behaviour of larger models on certain tasks. The paper's two-stage approach of separating the FLOPs $\rightarrow$ Loss and Loss $\rightarrow$ Performance predictive models circumvents the emergent behaviour issue with the FLOPs $\ri

Weaknesses

- The sharp transition in performance of TriviaQA from 1B to 3B models highlights the brittleness of the approach, where the error margins can be huge for downstream task performance prediction. And it's very hard to characterize this behaviour for a whole range of tasks that are usually used to compare various LMs. - I don't agree with the authors' point on enhancing sample efficiency by collecting losses corresponding to intermediate checkpoints and actually creates a biased estimator for the

Reviewer 02Rating 3Confidence 3

Strengths

A notable strength of the paper is the quality of the writing: the narrative is clear, and the experiments are thorough. Besides, the FLP-M method accurately predicting performance based on data loss from different domains, thus enhancing prediction accuracy in mixed data scenarios. Additionally, Figure 6 demonstrates that FLP-M can be used to derive the optimal data mixing ratio for training.

Weaknesses

1. The authors utilize intermediate checkpoints to gather data points; however, for the same amount of FLOPs, models with different N (parameters) and D (data) would yield distinct loss. This raises a critical question: why is it valid to use checkpoints that have not converged and are not optimized configurations to obtain data points? 2. The second drawback is a lack of novelty. Both using FLOPs to predict loss and using loss to predict downstream performance have been explored in prior work

Reviewer 03Rating 6Confidence 2

Strengths

1. **Practical Application Value** This paper introduces FLP-M, linking computational resources with LLM downstream performance. This research holds significant importance for real-world applications.

Weaknesses

1. **Limited Scale of LM** The largest model used in this paper is only 7B, yet there are many LLMs much larger than 7B (e.g., Llama-3 70B, Llama-3 405B). From this perspective, the conclusions of this paper are limited. 2. **Limited Domains in Data Mixing** As stated in the limitations, this paper only considers the domains of text and code under Data Mixing settings. Including more domains would enhance the explanatory power of the conclusions.

Reviewer 04Rating 5Confidence 5

Strengths

1. The paper identifies the issue of discontinuous performance when models approach the emergent edge, which is difficult to address with classical scaling laws, and proposes a method to resolve with the continuous variant ------ loss. 2. FLP creates more data points for fitting the scaling la, potentially making the fitted curve more generalizable. 3. FLP-M is introduced for data mixtures, providing a more accurate prediction by considering the different impacts of code and general text on down

Weaknesses

1. In section 3.2 Loss->Performance, there is a strong assumption that loss and accuracy have a linear relationship. Firstly, in all generative tasks shown in Figure 9, the linear relationship between loss and metric is not evident. The authors should provide more explicit statistical indicators to prove this linear correlation. Additionally, in the classification tasks shown in Figure 9, the relationship between loss and accuracy also encounters deviations near the emergent point, indicating th

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI and HR Technologies

MethodsFocus