TL;DR
AgentForge introduces an execution-grounded multi-agent framework for autonomous software engineering, emphasizing sandboxed code verification to improve correctness and performance.
Contribution
This work formalizes execution-grounded verification as a core principle and demonstrates its effectiveness in a multi-agent LLM system for software development.
Findings
Achieves 40.0% resolution on SWE-BENCH Lite, surpassing single-agent baselines by 26-28 points.
Execution feedback and role decomposition independently improve system performance.
Open-source implementation available at https://github.com/raja21068/AutoCodeAI.
Abstract
Large language models generate plausible code but cannot verify correctness. Existing multi-agent systems simulate execution or leave verification optional. We introduce execution-grounded verification as a first-class principle: every code change must survive sandboxed execution before propagation. We instantiate this principle in AGENTFORGE, a multi-agent framework where Planner, Coder, Tester, Debugger, and Critic agents coordinate through shared memory and a mandatory Docker sandbox. We formalize software engineering with LLMs as an iterative decision process over repository states, where execution feedback provides a stronger supervision signal than next-token likelihood. AGENTFORGE achieves 40.0\% resolution on SWE-BENCH Lite, outperforming single-agent baselines by 26--28 points. Ablations confirm that execution feedback and role decomposition each independently drive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
