Benchmarking LLMs for Unit Test Generation from Real-World Functions

Dong Huang; Jie M. Zhang; Mark Harman; Qianru Zhang; Mingzhe Du; See-Kiong Ng

arXiv:2508.00408·cs.SE·August 4, 2025

Benchmarking LLMs for Unit Test Generation from Real-World Functions

Dong Huang, Jie M. Zhang, Mark Harman, Qianru Zhang, Mingzhe Du, See-Kiong Ng

PDF

Open Access

TL;DR

This paper introduces ULT, a new challenging benchmark for evaluating large language models' ability to generate unit tests from real-world Python functions, addressing limitations of previous benchmarks.

Contribution

The paper presents ULT, a carefully curated, realistic benchmark with high complexity and minimal data contamination, enabling more accurate assessment of LLMs in unit test generation.

Findings

01

ULT is more challenging than existing benchmarks.

02

LLMs achieve lower performance metrics on ULT.

03

PLT enables analysis of memorization versus reasoning.

Abstract

Recently, large language models (LLMs) have shown great promise in automating unit test generation, significantly reducing the manual effort required by developers. To effectively evaluate the capabilities of LLMs in this domain, it is crucial to have a well-designed benchmark that accurately reflects real-world scenarios and mitigates common pitfalls. Existing LLM test generation benchmarks are limited by two critical drawbacks: data contamination and structurally simple function code. As a result, we often cannot rely on the validity of scientific conclusions drawn from empirical studies using these limited benchmarks. The empirical evidence presented may be biased due to contamination and may fail to generalize beyond toy programs due to structural simplicity. To address these problems, we introduce ULT (UnLeakedTestbench), a new benchmark specifically designed for function-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Model-Driven Software Engineering Techniques · Topic Modeling