LLM Benchmark Datasets Should Be Contamination-Resistant
Ali Al-Lawati, Jason Lucas, Dongwon Lee, Suhang Wang

TL;DR
This paper emphasizes the importance of contamination-resistant benchmark datasets for LLM evaluation, proposing methods to make datasets unlearnable during training but usable during inference.
Contribution
It introduces the concept of contamination-resistant datasets, leveraging Transformer architecture asymmetries and mathematical methods to enhance benchmark reliability.
Findings
Contamination is prevalent in benchmark datasets, reducing their reliability.
Proposed properties and methods support contamination-resistance across LLM architectures.
Community is urged to adopt contamination-resistant benchmarks for better evaluation.
Abstract
Benchmark datasets are critical for reproducible, reliable, and discriminative evaluation of LLMs. However, recent studies reveal that many benchmark datasets are included in pretraining corpora, i.e., , which diminishes their value as reliable measures of model generalization. In this paper, we argue that benchmark datasets should be , i.e., , but support . To accomplish this, we first highlight the wide prevalence of benchmark dataset contamination and outline the properties of contamination-resistant datasets. Second, we highlight how the asymmetry between the inference and training pipelines in the Transformer architecture can be leveraged to support contamination-resistance. Third, we outline mathematical advancements to make these datasets interoperable across various LLM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
