SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Taolin Zhang; Hang Guo; Wang Lu; Tao Dai; Shu-Tao Xia; Jindong Wang

arXiv:2602.07909·cs.CL·February 10, 2026

SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang

PDF

Open Access 3 Reviews

TL;DR

SparseEval introduces a novel sparse optimization approach using gradient descent to efficiently evaluate large language models, significantly reducing computational costs while maintaining high accuracy across benchmarks.

Contribution

It is the first method to apply gradient-based sparse optimization with iterative anchor refinement for LLM evaluation, improving efficiency and robustness.

Findings

01

Low estimation error across benchmarks

02

High Kendall's τ indicating strong correlation with full evaluations

03

Effective in real-world scenarios with reduced computational costs

Abstract

As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

Casting efficient LLM evaluation as a sparse optimization problem is conceptually clean and original. The gradient-based anchor refinement (AIS/CIS) is simple, intuitive, and empirically strong. Results across multiple datasets and ablations convincingly show superior efficiency and robustness. The approach can substantially reduce evaluation cost in large-scale LLM benchmarking.

Weaknesses

The proposed method relies on access to a large model–item performance matrix for training and anchor refinement. Its effectiveness when only a small number of model evaluations are available, or when evaluating a completely new task with limited historical data, has not been examined. This restricts its practical usability in real-world cold-start settings. While the paper frames efficient evaluation as a sparse optimization problem, the theoretical analysis remains shallow. The notion of spar

Reviewer 02Rating 4Confidence 3

Strengths

* This paper demonstrates the sparsity in the dataset, which enables the prediction of model performance using a small amount of anchor data. * The paper proposes SparseEval, which trains an MLP to make predictions based on the performance of existing models on both anchor data and the full dataset. * Experimental results show that this method can accurately select models that are representative of the entire dataset.

Weaknesses

* Given a new dataset, this method may require testing many models and performing training, which can be costly to train in practice. * The approach lacks generalization to stronger models and other architectures. Under the current training setup, it remains unclear whether the architecture can generalize to more powerful models or to new architectures such as linear attention or MoE models. * The method lacks evaluation of its adaptability under long-chain-of-thought (long-CoT) conditions, such

Reviewer 03Rating 6Confidence 1

Strengths

The paper offers a clear modeling perspective by framing evaluation sparsity through the model by item score matrix and performing sparse selection with learned weights directly aligned to overall scores and rankings. It introduces task aware anchor optimization via AIS and CIS to iteratively refine anchors beyond one shot selection, and employs a nonlinear MLP aggregator to capture inter item structure and model family differences. Empirically, it foregrounds practical gains, reporting low erro

Weaknesses

1. Do your learned anchors (and the associated aggregation/weights) directly generalize to unseen models? In other words, can we train anchors once and apply them to new model releases without retraining? 2. What is the end-to-end cost of discovering/learning the anchor set (including any proxy-model runs, scoring, and training)? If this cost is substantial, and the resulting anchors/weights only work well for a single model, then the approach may not be worthwhile. Please quantify the cost and

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Sentiment Analysis and Opinion Mining