Addressing Data Leakage in HumanEval Using Combinatorial Test Design
Jeremy S. Bradbury, Riddhi More

TL;DR
This paper introduces a novel benchmark construction method using combinatorial test design to minimize data leakage in LLM evaluation, exemplified by creating HumanEval_T as an alternative to HumanEval.
Contribution
It proposes a new approach for constructing benchmarks with template tasks and combinatorial test design to reduce data leakage effects in LLM performance assessment.
Findings
HumanEval_T reduces data leakage compared to HumanEval
Template-based benchmark construction improves fairness in evaluation
Method is applicable to other benchmark datasets
Abstract
The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Real-Time Systems Scheduling · Software Reliability and Analysis Research
