TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

Steven Liu; Jane Luo; Xin Zhang; Aofan Liu; Hao Liu; Jie Wu; Ziyang Huang; Yangyu Huang; Yu Kang; Scarlett Li

arXiv:2602.10471·cs.SE·February 24, 2026

TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation

Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu, Jie Wu, Ziyang Huang, Yangyu Huang, Yu Kang, Scarlett Li

PDF

Open Access

TL;DR

TestExplora introduces a benchmark for evaluating large language models as proactive software testers, revealing current models' limited bug discovery capabilities and emphasizing the importance of agentic exploration for autonomous quality assurance.

Contribution

We present TestExplora, a novel benchmark with realistic, repository-level tasks that evaluates LLMs' ability for proactive bug discovery using documentation as an oracle.

Findings

01

State-of-the-art models achieve a maximum Fail-to-Pass rate of 16.06%.

02

Agentic exploration with GPT-5-mini improves F2P to 17.27% and F2P@5 to 29.7%.

03

Navigating cross-module interactions is key to enhancing bug discovery.

Abstract

Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices