Addressing Data Leakage in HumanEval Using Combinatorial Test Design

Jeremy S. Bradbury; Riddhi More

arXiv:2412.01526·cs.SE·December 3, 2024

Addressing Data Leakage in HumanEval Using Combinatorial Test Design

Jeremy S. Bradbury, Riddhi More

PDF

Open Access

TL;DR

This paper introduces a novel benchmark construction method using combinatorial test design to minimize data leakage in LLM evaluation, exemplified by creating HumanEval_T as an alternative to HumanEval.

Contribution

It proposes a new approach for constructing benchmarks with template tasks and combinatorial test design to reduce data leakage effects in LLM performance assessment.

Findings

01

HumanEval_T reduces data leakage compared to HumanEval

02

Template-based benchmark construction improves fairness in evaluation

03

Method is applicable to other benchmark datasets

Abstract

The use of large language models (LLMs) is widespread across many domains, including Software Engineering, where they have been used to automate tasks such as program generation and test classification. As LLM-based methods continue to evolve, it is important that we define clear and robust methods that fairly evaluate performance. Benchmarks are a common approach to assess LLMs with respect to their ability to solve problem-specific tasks as well as assess different versions of an LLM to solve tasks over time. For example, the HumanEval benchmark is composed of 164 hand-crafted tasks and has become an important tool in assessing LLM-based program generation. However, a major barrier to a fair evaluation of LLMs using benchmarks like HumanEval is data contamination resulting from data leakage of benchmark tasks and solutions into the training data set. This barrier is compounded by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Real-Time Systems Scheduling · Software Reliability and Analysis Research