Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Wang Bill Zhu; Miaosen Chai; Shangshang Wang; Yejia Liu; Song Bian; Honghua Dong; Willie Neiswanger; Robin Jia

arXiv:2604.17338·cs.SE·May 19, 2026

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Wang Bill Zhu, Miaosen Chai, Shangshang Wang, Yejia Liu, Song Bian, Honghua Dong, Willie Neiswanger, Robin Jia

PDF

1 Repo

TL;DR

This paper introduces the Precise Debugging Benchmark (PDB) framework to evaluate how accurately large language models perform debugging tasks, revealing current models often regenerate solutions rather than precisely fix bugs.

Contribution

The paper presents a novel benchmarking framework with new metrics and datasets for evaluating the precision of debugging by large language models, highlighting their limitations.

Findings

01

Models achieve over 76% pass rates but have below 45% precision.

02

Iterative and agentic debugging strategies do not significantly improve performance.

03

The framework automatically converts coding datasets into debugging benchmarks with precision-aware evaluation.

Abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bill1235813/PDB
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.