VarBench: Robust Language Model Benchmarking Through Dynamic Variable   Perturbation

Kun Qian; Shunji Wan; Claudia Tang; Youzhi Wang; Xuanming Zhang,; Maximillian Chen; Zhou Yu

arXiv:2406.17681·cs.CL·June 27, 2024·1 cites

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation

Kun Qian, Shunji Wan, Claudia Tang, Youzhi Wang, Xuanming Zhang,, Maximillian Chen, Zhou Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces VarBench, a dynamic benchmarking method that variabilizes test cases to provide fair, contamination-resistant evaluation of large language models across multiple datasets.

Contribution

It proposes a novel variable perturbation approach for benchmarking, enabling dynamic, fair, and contamination-resistant evaluation of language models.

Findings

01

Improved accuracy in assessing true model capabilities.

02

Effective mitigation of data contamination issues.

03

Versatile application across diverse datasets.

Abstract

As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model's predictions for centralized processing and then publish the model's result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qbetterk/VarBench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training