BenTo: Benchmark Task Reduction with In-Context Transferability
Hongyu Zhao, Ming Li, Lichao Sun, Tianyi Zhou

TL;DR
This paper introduces BenTo, a method to efficiently reduce large language model benchmarks by selecting a representative subset of tasks based on transferability, significantly lowering evaluation costs with minimal accuracy loss.
Contribution
BenTo proposes a transferability-based task selection method that is training-free, gradient-free, and reduces benchmark size to 5% with less than 4% evaluation discrepancy.
Findings
Reduces benchmark tasks to 5% of original size
Maintains less than 4% difference in evaluation results
Method is training-free and highly efficient
Abstract
Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems
