Less is more: Not all samples are effective for evaluation
Wentang Song, Jinqiang Li, Kele Huang, Junhui Lin, Shengxiang Wu, Zhongshi Xie

TL;DR
This paper introduces a history-free test set compression method for LLM evaluation that significantly reduces computational costs by removing redundant samples without relying on prior model performance data.
Contribution
It proposes a novel domain-adapted embedding and clustering framework with a dataset X-ray mechanism to dynamically calibrate test set compression.
Findings
Reduces evaluation cost by over 90%
Maintains high fidelity to full benchmark results
Effective on large-scale professional-domain datasets
Abstract
The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare
