Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs
Ganghua Wang, Zhaorun Chen, Bo Li, Haifeng Xu

TL;DR
Cer-Eval is a novel evaluation framework for large language models that adaptively selects test samples to reduce evaluation costs by 20-40% while ensuring high-confidence performance estimates.
Contribution
It introduces a certifiable, cost-efficient evaluation method with theoretical bounds and an adaptive algorithm for test sample selection in LLM assessment.
Findings
Reduces test sample requirements by 20-40%.
Maintains high-confidence evaluation accuracy.
Provides theoretical bounds on test sample complexity.
Abstract
As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of systematic analysis and guidance on determining the sufficiency of test data or selecting informative samples for evaluation. This paper introduces a certifiable and cost-efficient evaluation framework for LLMs. Our framework adapts to different evaluation objectives and outputs confidence intervals that contain true values with high probability. We use ``test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity. Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The problem is well defined and the formulation is well motivated. The paper addresses a genuinely critical problem. The problem is well-known, yet many papers (both from industry and academic) are still using static evaluation which is sample inefficient. 2. The paper lays a strong theoretical foundation and introduces principled approach to variance reduction. The authors have done a great analysis matching the upper and lower bounds on the test sample complexity, and the theorem 5.2 prov
1. My biggest concern for the paper is the critical gap in baseline comparisons. The paper only effectively compare the method with two baselines, i.e., the static evaluation and vanilla online evaluation process. There are quite a few obvious papers that address the same problem are not included. For example, TinyBenchmarks (Polo et all., 2024) leverages Item Responses Theory (IRT) and is able to achieve <2% estimation error based on 1% of the full MMLU dataset. Similarly, StratPPI (Fisch et al
- The proposed evaluation approach seems to reduce the number of test samples required to obtain a confident evaluation, which can be of great use to practitioners since LLM evaluations are typically computationally expensive. - The proposed evaluation is principled and theoretically motivated. - The paper provides extensive evaluations on a synthetic task and for four models on three "real-world" datasets, empirically validating the effectiveness of their proposed evaluation.
*Missing breakeven analysis.* The discussion in the paragraph starting line 888 is not sufficient. The paper would require a thorough discussion on the breakeven point, i.e. the point at which using this evaluation is better than evaluating on the entire dataset. I am not sufficiently convinced that the proposed approach is always preferable - as the user has to tolerate a moderate to larger estimation error and/or the test dataset needs to be sufficiently large. I think the point where your app
- Addresses two underexplored but important problems: (i) certifiable / reliable, and (ii) efficient evaluation of LLMs. - The authors provide theoretical motivation and formal guarantees for their framework - Empirical evidence is provided to show that this method can lead to practical efficiency gains on popular benchmarks - The authors provide relevant context regarding prior work in this area, specifically work concerned with efficient LLM evaluations - Ablation on the influence of the embe
- The experimental validation focuses mainly on accuracy and efficiency. I would have liked an evaluation of the robustness of the confidence intervals (e.g., with minor perturbations to the data or repeated experiments) - It should be emphasized, that these results only hold if the partition is i.i.d. with respect to the rest of the evaluation set, which might not always be true in practice (or under adversarial attacks / poisoning / etc.) - No evaluation of the overhead of the algorithm, despi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
