Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair
Noor Nashid, Daniel Ding, Keheliya Gallaba, Ahmed E. Hassan, Ali Mesbah

TL;DR
This study systematically evaluates large language model-driven coding agents on multi-hunk bug repair, revealing their strengths, limitations, and the impact of context-aware tools like Maple on repair accuracy.
Contribution
First comprehensive analysis of LLM-driven agents on multi-hunk bug repair, introducing fine-grained metrics and the Maple tool for improved localization and repair performance.
Findings
Repair accuracy varies significantly among agents.
Higher bug dispersion reduces repair success.
Maple improves Gemini-cli's accuracy by 30%.
Abstract
Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these agents on 372 multi-hunk bugs from the Hunk4J dataset, analyzing 1,488 repair trajectories using fine-grained metrics that capture localization, repair accuracy, regression behavior, and operational dynamics. Results reveal substantial variation: repair accuracy ranges from 25.8% (Qwen Code) to 93.3% (Claude Code) and consistently declines with increasing bug dispersion and complexity. High-performing agents demonstrate superior semantic consistency, achieving positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Security and Verification in Computing
