
TL;DR
This paper introduces a submodular optimization framework for selecting a small, representative subset of benchmarks to evaluate large language models efficiently, leveraging entropy and mutual information.
Contribution
It formalizes the benchmark selection problem as submodular maximization under a Gaussian model, proposing methods that outperform existing approaches in experiments.
Findings
Mutual information selection outperforms entropy for small subset imputation.
Entropy selection aligns with pivoted Cholesky and has spectral bounds.
Experiments on public leaderboards validate the proposed methods.
Abstract
Evaluating large language models across many benchmarks is expensive, yet many benchmarks are highly correlated. We formalize the selection of a small, informative subset as submodular maximization under a multivariate Gaussian model. Entropy (log-determinant covariance) and mutual information between selected and remaining benchmarks arise as natural objectives. Both are submodular; entropy selection coincides with pivoted Cholesky and has spectral residual bounds, while mutual information is non-monotone in general but empirically monotone for small subsets, so we optimize it greedily. Experiments on three matrices from ten public leaderboards show that mutual information selection outperforms entropy for imputation at small subsets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
