REDO: Execution-Free Runtime Error Detection for COding Agents
Shou Li, Andrey Kan, Laurent Callot, Bhavana Bhasker, Muhammad Shihab, Rashid, Timothy B Esler

TL;DR
REDO introduces a novel static analysis approach combining LLMs and tools to detect runtime errors in code without execution, significantly improving error detection accuracy for coding agents.
Contribution
This work presents REDO, a new method integrating LLMs with static analysis for runtime error detection, and introduces SWEDE, a benchmark for evaluating such errors in complex code repositories.
Findings
REDO achieves 11.0% higher accuracy than state-of-the-art methods.
REDO attains 9.1% higher weighted F1 score.
The approach provides better error detection insights for coding agents.
Abstract
As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level…
Peer Reviews
Decision·Submitted to ICLR 2025
I think the paper and the overarching idea of using existing program analysis tools in conjunction with LLMs within agents/systems is quite timely. Several recent breakthroughs in tasks such as SWEBench have been made by designing and integrating several classical tools such as linters, type checkers, repository dependency graphs, etc. I also found the data recipe to create the SWEDE benchmark from existing agent failures to be quite interesting.
**Motivations unclear**. The paper starts of with SWEBench and software agents as the motivation. However the task suddenly switches to simply predicting execution errors. The paper never circles back on how this prediction could actually help agent/systems that generate code. This makes the overall motivation of the paper quite weak in my opinion. While it is clear that setting up execution-environments for real-world code is challenging and an execution-free analyzer helps, the paper does not
1. important problem. Coding agents face a very big problem of generating buggy and unreadable codes. Automatically generated codes are difficult to debug manually. It is important to find automatic ways of analyzing LLM-generated codes. The authors provide an alternative solution in this direction. The proposed hybrid approach is reasonable and intuitive. Undoubtedly, LLMs can help to some extent find bugs in programs and could be leveraged with other algorithm-based approaches such as static
1. The capability of detecting bugs purely relies on backend LLMs and static analysis tools. I didn't see how the two approaches can be deeply integrated and collaborated. Apparently, the simple loose coupling of static analysis tools and LLMs cannot find all bugs in codes. Actually, no method exists to find all bugs. It would be very important to investigate collaborative ways of leveraging the advantages of tools and LLMs to find bugs or errors as many as possible. The current integration appr
- Originality REDO combines static analysis with LLMs, and although similar approaches are becoming more common, this implementation remains useful. - Quality The empirical evaluation of REDO provides meaningful insights into the effectiveness of the proposed approach. - Clarity Overall, the paper is not difficult to follow. - Significance REDO addresses an important issue by ensuring program code security with LLMs.
- Overclaim This paper claims to focus on error detection for coding agents; however, the REDO approach appears to be a more broadly applicable method for general program code error detection, which creates a misalignment in the work's focus. - Limited Novelty REDO combines PyRight with LLMs in a relatively straightforward way, with the LLM engaged only when PyRight deems the code safe. There seems to be no interaction or integration between the static analysis tool and the LLM beyond this se
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Distributed systems and fault tolerance · Real-Time Systems Scheduling
MethodsFocus
