Less is more: Not all samples are effective for evaluation

Wentang Song; Jinqiang Li; Kele Huang; Junhui Lin; Shengxiang Wu; Zhongshi Xie

arXiv:2601.03272·cs.CL·January 8, 2026

Less is more: Not all samples are effective for evaluation

Wentang Song, Jinqiang Li, Kele Huang, Junhui Lin, Shengxiang Wu, Zhongshi Xie

PDF

Open Access

TL;DR

This paper introduces a history-free test set compression method for LLM evaluation that significantly reduces computational costs by removing redundant samples without relying on prior model performance data.

Contribution

It proposes a novel domain-adapted embedding and clustering framework with a dataset X-ray mechanism to dynamically calibrate test set compression.

Findings

01

Reduces evaluation cost by over 90%

02

Maintains high fidelity to full benchmark results

03

Effective on large-scale professional-domain datasets

Abstract

The versatility of Large Language Models (LLMs) in vertical domains has spurred the development of numerous specialized evaluation benchmarks. However, these benchmarks often suffer from significant semantic redundancy and impose high computational costs during evaluation. Existing compression methods, such as tinyBenchmarks depend critically on correctness labels from multiple historical models evaluated on the full test set, making them inapplicable in cold-start scenarios, such as the introduction of a new task, domain, or model with no prior evaluation history. To address this limitation, we propose a history-free test set compression framework that requires no prior model performance data. Our method begins by fine-tuning a base LLM on a small amount of domain-specific data to internalize task-relevant semantics. It then generates high-level semantic embeddings for all original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare