Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Chengyin Xu; Kaiyuan Chen; Xiao Li; Ke Shen; Chenggang Li

arXiv:2502.17262·cs.CL·March 10, 2026

Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective

Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a clustering-based framework called COD to improve the prediction of downstream task performance in large language models, addressing variability and emergence phenomena during scaling.

Contribution

The paper proposes a novel clustering-on-difficulty framework that enhances performance prediction accuracy and reliability for large language models during scaling.

Findings

01

COD achieves 1.55% average prediction error on benchmarks

02

Clusters tasks by difficulty to stabilize performance predictions

03

Mapping functions accurately extrapolate subset performance to full set

Abstract

The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The COD approach provides a clear decomposition of the performance-prediction problem into clustering, cluster-wise fitting, subset extrapolation, and subset-to-full mapping. This layered design could handle emergent and heterogeneous behaviors better. 2. The improved MeanShift with radius and minimum-cluster constraints balances automatic cluster discovery and intra-cluster homogeneity, outperforming K-Means, DBSCAN, and baseline MeanShift on both intra-cluster metrics and prediction error.

Weaknesses

1. COD introduces multiple stages (feature extraction, clustering, extrapolation, mapping) with several hyperparameters (radius R, min cluster K, thresholds a/b/c, polynomial degree). This may hinder practical adoption and stability. 2. In what extent of error rate, could the preformance prediction be useful? Can this actually reflect the actual final benchmark performance? 3. In introduct the error rate is 1.36%, but in the table of experiment results, the error rate seems to be 1.63, which is

Reviewer 02Rating 8Confidence 5

Strengths

- Method is effective, well justified, and works across large set of taks. - The emprical results are very convincing. - The paper includes extensive ablations in main text and appendix. - Paper is important contribuition in the field of estimating downstream performance, as it significantly improve ability to predict downstream performance.

Weaknesses

- The core for the clustering approach seems to hinge on the estimation of difficulty per each instance in a benchmark. To do so, the paper seems to imply that models that are also target of scaling laws are used. Unclear if the proposed approach would work if using a different model family (i.e., how would this approach work if one has to bootstrap from external open-weights LLMs?) - To estimate task difficulty, it seems that one need a model already trained at largest target size to build clus

Reviewer 03Rating 2Confidence 4

Strengths

1. Directly predicting downstream performance (not just loss) is operationally valuable. 2. The author spends much effort on training nine models (122M→70B) under a consistent recipe and eight diverse benchmarks. Such results are valuable for the community. 3. The paper is well-written and easy to understand.

Weaknesses

1. The generalizability of the model is questionable. All selected benchmarks are widely adopted in the community. How does the proposed generalize to unseen benchmarks? 2. The author only compares with their own baselines. Many performance prediction solutions on downstream benchmarks are missing in the comparison. 3. The COD experiments heavily rely on heuristic hyperparameters such as a, b & c. This limits its generalizability to other cases. 4. Typos: in the abstract, the mean error is 1.36%

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCollaboration in agile enterprises