Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective
Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, Chenggang Li

TL;DR
This paper introduces a clustering-based framework called COD to improve the prediction of downstream task performance in large language models, addressing variability and emergence phenomena during scaling.
Contribution
The paper proposes a novel clustering-on-difficulty framework that enhances performance prediction accuracy and reliability for large language models during scaling.
Findings
COD achieves 1.55% average prediction error on benchmarks
Clusters tasks by difficulty to stabilize performance predictions
Mapping functions accurately extrapolate subset performance to full set
Abstract
The escalating scale and cost of Large Language Models (LLMs) training necessitate accurate pre-training prediction of downstream task performance for comprehensive understanding of scaling properties. This is challenged by: 1) the emergence phenomenon, where unpredictable capabilities appearing suddenly at critical model scales; and 2) uneven task difficulty and inconsistent performance scaling patterns, leading to high metric variability. Current prediction methods lack accuracy and reliability. We propose a Clustering-On-Difficulty (COD) framework for downstream performance prediction. The COD framework clusters tasks by their difficulty scaling features, thereby constructing a more stable and predictable task subset that exhibits well-behaved scaling characteristics with the increase of compute budget. We adopt a performance scaling law to predict cluster-wise performance with…
Peer Reviews
Decision·ICLR 2026 Poster
1. The COD approach provides a clear decomposition of the performance-prediction problem into clustering, cluster-wise fitting, subset extrapolation, and subset-to-full mapping. This layered design could handle emergent and heterogeneous behaviors better. 2. The improved MeanShift with radius and minimum-cluster constraints balances automatic cluster discovery and intra-cluster homogeneity, outperforming K-Means, DBSCAN, and baseline MeanShift on both intra-cluster metrics and prediction error.
1. COD introduces multiple stages (feature extraction, clustering, extrapolation, mapping) with several hyperparameters (radius R, min cluster K, thresholds a/b/c, polynomial degree). This may hinder practical adoption and stability. 2. In what extent of error rate, could the preformance prediction be useful? Can this actually reflect the actual final benchmark performance? 3. In introduct the error rate is 1.36%, but in the table of experiment results, the error rate seems to be 1.63, which is
- Method is effective, well justified, and works across large set of taks. - The emprical results are very convincing. - The paper includes extensive ablations in main text and appendix. - Paper is important contribuition in the field of estimating downstream performance, as it significantly improve ability to predict downstream performance.
- The core for the clustering approach seems to hinge on the estimation of difficulty per each instance in a benchmark. To do so, the paper seems to imply that models that are also target of scaling laws are used. Unclear if the proposed approach would work if using a different model family (i.e., how would this approach work if one has to bootstrap from external open-weights LLMs?) - To estimate task difficulty, it seems that one need a model already trained at largest target size to build clus
1. Directly predicting downstream performance (not just loss) is operationally valuable. 2. The author spends much effort on training nine models (122M→70B) under a consistent recipe and eight diverse benchmarks. Such results are valuable for the community. 3. The paper is well-written and easy to understand.
1. The generalizability of the model is questionable. All selected benchmarks are widely adopted in the community. How does the proposed generalize to unseen benchmarks? 2. The author only compares with their own baselines. Many performance prediction solutions on downstream benchmarks are missing in the comparison. 3. The COD experiments heavily rely on heuristic hyperparameters such as a, b & c. This limits its generalizability to other cases. 4. Typos: in the abstract, the mean error is 1.36%
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCollaboration in agile enterprises
