Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback

Chandra Maddila; Adam Tait; Claire Chang; Daniel Cheng; Nauman Ahmad; Vijayaraghavan Murali; Marshall Roch; Arnaud Avondet; Aaron Meltzer; Victor Montalvao; Michael Hopko; Chris Waterson; Parth Thakkar; Renuka Fernandez; Kristian Kristensen; Sivan Barzily; Sherry Chen; Rui Abreu; Nachiappan Nagappan; Payam Shodjai; Killian Murphy; James Everingham; Aparna Ramani; Peter C. Rigby

arXiv:2507.18755·cs.SE·July 28, 2025

Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback

Chandra Maddila, Adam Tait, Claire Chang, Daniel Cheng, Nauman Ahmad, Vijayaraghavan Murali, Marshall Roch, Arnaud Avondet, Aaron Meltzer, Victor Montalvao, Michael Hopko, Chris Waterson, Parth Thakkar, Renuka Fernandez, Kristian Kristensen, Sivan Barzily, Sherry Chen, Rui Abreu

PDF

Open Access

TL;DR

This paper presents a neuro-symbolic, agent-based approach for large-scale program repair using LLMs, static analysis, and test feedback, achieving significant automation and efficiency improvements.

Contribution

It introduces an engineering agent that combines LLMs, static analysis, and symbolic reasoning for automated program repair at scale, with a novel feedback loop and evaluation benchmarks.

Findings

01

Specialized 70B model performs competitively with larger models.

02

ReAct harness benefits from static analysis and test traces.

03

42.3% solve rate with 11.8 feedback iterations on benchmarks.

Abstract

Aim: With the advent of LLMs, sophisticated agentic program repair has become viable at large organizations with large codebases. In this work, we develop an Engineering Agent that fixes the source code based on test failures at scale across diverse software offerings internally. Method: Using Llama as the base, we employ the ReAct harness to develop an agent. We start with a test failure that was triaged by a rule-based test failure bot. We then set up an agentic harness and allow the agent to reason and run a set of 15 actions from reading a file to generating a patch. We provide feedback to the agent through static analysis and test failures so it can refine its solution. We leverage an LLM-as-a-Judge to ensure that the patch conforms to the standards followed by a human review to land fixes. Benchmark Findings: We curated offline benchmarks for our patch generator, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Software Reliability and Analysis Research