PBT-Bench: Benchmarking AI Agents on Property-Based Testing
Lucas Jing, Xinqi Wang, Liao Zhang, Simon S. Du

TL;DR
PBT-Bench is a new benchmark with 100 property-based testing problems designed to evaluate AI agents' ability to derive invariants and generate precise inputs, highlighting differences in model capabilities.
Contribution
This paper introduces PBT-Bench, a benchmark for property-based testing that evaluates AI agents' semantic reasoning and input generation skills across diverse Python libraries.
Findings
Bug recall ranges from 42.1% to 83.4% with structured prompts.
Hypothesis scaffolding improves performance for mid-capability models.
Different architectures fail on different problems, indicating persistent gaps.
Abstract
Existing code benchmarks measure whether an agent can produce any test that reproduces a known bug, or whether it can produce a patch that fixes a described issue. Neither isolates the distinct skill of property-based testing: deriving a semantic invariant from documentation, and then constructing an input-generation strategy precise enough to make a random search reveal the violation. We introduce PBT-Bench, a benchmark of 100 curated property-based testing problems across 40 real Python libraries. Each problem injects one or more semantic bugs (365 in total, mean 3.65 per problem) designed so that default-strategy random inputs almost never trigger them; the agent must read the library's documentation, identify the relevant invariant, and specify a Hypothesis @given strategy that concentrates mass in the trigger region. Bugs are stratified across three difficulty levels…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
