MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang; Congying Xia; Wenting Zhao; Jiangshu Du; Chunyu Miao; Zhongfen Deng; Philip S. Yu; Chen Xing

arXiv:2502.06556·cs.SE·April 8, 2026·2 cites

MultiFileTest: A Multi-File-Level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing

PDF

1 Repo

TL;DR

This paper introduces MultiFileTest, a challenging multi-file-level benchmark for LLM unit test generation across multiple programming languages, revealing current models' moderate performance and the impact of error-fixing mechanisms.

Contribution

It presents the first multi-file-level benchmark for LLM unit test generation, evaluates state-of-the-art models, and analyzes the effects of error fixing on their performance.

Findings

01

Most frontier LLMs show moderate performance on MultiFileTest.

02

Advanced LLMs still make critical errors like executability and cascade errors.

03

Error-fixing mechanisms improve LLMs' test generation capabilities.

Abstract

Unit test generation has become a promising and important Large Language Model (LLM) use case. However, existing evaluation benchmarks for LLM unit test generation focus on function- or class-level code (single-file) rather than more practical and challenging multi-file-level codebases. To address such a limitation, we propose MultiFileTest, a multi-file-level benchmark for unit test generation covering Python, Java, and JavaScript. MultiFileTest features 20 moderate-sized and high-quality projects per language. We evaluate eleven frontier LLMs on MultiFileTest, and the results show that most frontier LLMs tested exhibit moderate performance on MultiFileTest, highlighting the difficulty of MultiFileTest. We also conduct a thorough error analysis, which shows that even advanced LLMs, such as Gemini-3.0-Pro, exhibit basic yet critical errors, including executability and cascade errors.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YiboWANG214/ProjectTest
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.