Evaluating Agent-based Program Repair at Google
Pat Rondon, Renyao Wei, Jos\'e Cambronero, J\"urgen Cito, Aaron Sun,, Siddhant Sanyam, Michele Tufano, Satish Chandra

TL;DR
This paper evaluates the effectiveness of agent-based program repair in an industrial setting at Google, establishing a baseline performance on a new, diverse bug dataset from Google's issue tracking system.
Contribution
It introduces a new evaluation dataset from Google and demonstrates the performance of an agentic repair approach, Passerine, in an enterprise context.
Findings
Passerine repairs 73% of machine-reported bugs plausibly.
Passerine repairs 25.6% of human-reported bugs plausibly.
Approximately 43% of machine-reported bugs have semantically equivalent patches.
Abstract
Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects. In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark. This paper explores the viability of using an agentic approach to address bugs in an enterprise context. To investigate this, we curate an evaluation set of 178 bugs drawn from Google's issue tracking system. This dataset spans both human-reported (78) and machine-reported bugs (100). To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google's development…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
