TL;DR
This paper introduces the Precise Debugging Benchmark (PDB) framework to evaluate how accurately large language models perform debugging tasks, revealing current models often regenerate solutions rather than precisely fix bugs.
Contribution
The paper presents a novel benchmarking framework with new metrics and datasets for evaluating the precision of debugging by large language models, highlighting their limitations.
Findings
Models achieve over 76% pass rates but have below 45% precision.
Iterative and agentic debugging strategies do not significantly improve performance.
The framework automatically converts coding datasets into debugging benchmarks with precision-aware evaluation.
Abstract
Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
