SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization
Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang

TL;DR
SparseEval introduces a novel sparse optimization approach using gradient descent to efficiently evaluate large language models, significantly reducing computational costs while maintaining high accuracy across benchmarks.
Contribution
It is the first method to apply gradient-based sparse optimization with iterative anchor refinement for LLM evaluation, improving efficiency and robustness.
Findings
Low estimation error across benchmarks
High Kendall's τ indicating strong correlation with full evaluations
Effective in real-world scenarios with reduced computational costs
Abstract
As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to…
Peer Reviews
Decision·ICLR 2026 Poster
Casting efficient LLM evaluation as a sparse optimization problem is conceptually clean and original. The gradient-based anchor refinement (AIS/CIS) is simple, intuitive, and empirically strong. Results across multiple datasets and ablations convincingly show superior efficiency and robustness. The approach can substantially reduce evaluation cost in large-scale LLM benchmarking.
The proposed method relies on access to a large model–item performance matrix for training and anchor refinement. Its effectiveness when only a small number of model evaluations are available, or when evaluating a completely new task with limited historical data, has not been examined. This restricts its practical usability in real-world cold-start settings. While the paper frames efficient evaluation as a sparse optimization problem, the theoretical analysis remains shallow. The notion of spar
* This paper demonstrates the sparsity in the dataset, which enables the prediction of model performance using a small amount of anchor data. * The paper proposes SparseEval, which trains an MLP to make predictions based on the performance of existing models on both anchor data and the full dataset. * Experimental results show that this method can accurately select models that are representative of the entire dataset.
* Given a new dataset, this method may require testing many models and performing training, which can be costly to train in practice. * The approach lacks generalization to stronger models and other architectures. Under the current training setup, it remains unclear whether the architecture can generalize to more powerful models or to new architectures such as linear attention or MoE models. * The method lacks evaluation of its adaptability under long-chain-of-thought (long-CoT) conditions, such
The paper offers a clear modeling perspective by framing evaluation sparsity through the model by item score matrix and performing sparse selection with learned weights directly aligned to overall scores and rankings. It introduces task aware anchor optimization via AIS and CIS to iteratively refine anchors beyond one shot selection, and employs a nonlinear MLP aggregator to capture inter item structure and model family differences. Empirically, it foregrounds practical gains, reporting low erro
1. Do your learned anchors (and the associated aggregation/weights) directly generalize to unseen models? In other words, can we train anchors once and apply them to new model releases without retraining? 2. What is the end-to-end cost of discovering/learning the anchor set (including any proxy-model runs, scoring, and training)? If this cost is substantial, and the resulting anchors/weights only work well for a single model, then the approach may not be worthwhile. Please quantify the cost and
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Sentiment Analysis and Opinion Mining
