CodeMonkeys: Scaling Test-Time Compute for Software Engineering
Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark,, Christopher R\'e, and Azalia Mirhoseini

TL;DR
CodeMonkeys is a system that enhances large language models' ability to solve real-world software issues by iteratively editing codebases through combined serial and parallel test-time compute, improving success rates on GitHub issues.
Contribution
We introduce CodeMonkeys, a novel approach that scales test-time compute for software engineering tasks by combining iterative editing with multi-trajectory sampling, achieving significant problem-solving improvements.
Findings
Resolved 57.4% of issues from SWE-bench.
Achieved 66.2% success with ensemble candidate selection.
Demonstrated effective parallel and serial scaling of test-time compute.
Abstract
Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Scientific Computing and Data Management · Embedded Systems Design Techniques
