Tests4Py: A Benchmark for System Testing
Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, and Andreas Zeller

TL;DR
Tests4Py is a comprehensive benchmark for system testing in Python, featuring 79 bugs with functional correctness oracles, supporting both system and unit test generation for research in test automation.
Contribution
It introduces a new benchmark derived from BugsInPy with improved oracles and test support, enabling advanced research in test generation and debugging.
Findings
Includes 73 bugs from real-world Python applications
Supports both system and unit test generation
Facilitates extensive evaluation and research
Abstract
Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software System Performance and Reliability
