TestExplora: Benchmarking LLMs for Proactive Bug Discovery via Repository-Level Test Generation
Steven Liu, Jane Luo, Xin Zhang, Aofan Liu, Hao Liu, Jie Wu, Ziyang Huang, Yangyu Huang, Yu Kang, Scarlett Li

TL;DR
TestExplora introduces a benchmark for evaluating large language models as proactive software testers, revealing current models' limited bug discovery capabilities and emphasizing the importance of agentic exploration for autonomous quality assurance.
Contribution
We present TestExplora, a novel benchmark with realistic, repository-level tasks that evaluates LLMs' ability for proactive bug discovery using documentation as an oracle.
Findings
State-of-the-art models achieve a maximum Fail-to-Pass rate of 16.06%.
Agentic exploration with GPT-5-mini improves F2P to 17.27% and F2P@5 to 29.7%.
Navigating cross-module interactions is key to enhancing bug discovery.
Abstract
Given that Large Language Models (LLMs) are increasingly applied to automate software development, comprehensive software assurance spans three distinct goals: regression prevention, reactive reproduction, and proactive discovery. Current evaluations systematically overlook the third goal. Specifically, they either treat existing code as ground truth (a compliance trap) for regression prevention, or depend on post-failure artifacts (e.g., issue reports) for bug reproduction-so they rarely surface defects before failures. To bridge this gap, we present TestExplora, a benchmark designed to evaluate LLMs as proactive testers within full-scale, realistic repository environments. TestExplora contains 2,389 tasks from 482 repositories and hides all defect-related signals. Models must proactively find bugs by comparing implementations against documentation-derived intent, using documentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software Engineering Techniques and Practices
