REDO: Execution-Free Runtime Error Detection for COding Agents

Shou Li; Andrey Kan; Laurent Callot; Bhavana Bhasker; Muhammad Shihab; Rashid; Timothy B Esler

arXiv:2410.09117·cs.SE·October 15, 2024

REDO: Execution-Free Runtime Error Detection for COding Agents

Shou Li, Andrey Kan, Laurent Callot, Bhavana Bhasker, Muhammad Shihab, Rashid, Timothy B Esler

PDF

Open Access 3 Reviews

TL;DR

REDO introduces a novel static analysis approach combining LLMs and tools to detect runtime errors in code without execution, significantly improving error detection accuracy for coding agents.

Contribution

This work presents REDO, a new method integrating LLMs with static analysis for runtime error detection, and introduces SWEDE, a benchmark for evaluating such errors in complex code repositories.

Findings

01

REDO achieves 11.0% higher accuracy than state-of-the-art methods.

02

REDO attains 9.1% higher weighted F1 score.

03

The approach provides better error detection insights for coding agents.

Abstract

As LLM-based agents exhibit exceptional capabilities in addressing complex problems, there is a growing focus on developing coding agents to tackle increasingly sophisticated tasks. Despite their promising performance, these coding agents often produce programs or modifications that contain runtime errors, which can cause code failures and are difficult for static analysis tools to detect. Enhancing the ability of coding agents to statically identify such errors could significantly improve their overall performance. In this work, we introduce Execution-free Runtime Error Detection for COding Agents (REDO), a method that integrates LLMs with static analysis tools to detect runtime errors for coding agents, without code execution. Additionally, we propose a benchmark task, SWE-Bench-Error-Detection (SWEDE), based on SWE-Bench (lite), to evaluate error detection in repository-level…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 4

Strengths

I think the paper and the overarching idea of using existing program analysis tools in conjunction with LLMs within agents/systems is quite timely. Several recent breakthroughs in tasks such as SWEBench have been made by designing and integrating several classical tools such as linters, type checkers, repository dependency graphs, etc. I also found the data recipe to create the SWEDE benchmark from existing agent failures to be quite interesting.

Weaknesses

**Motivations unclear**. The paper starts of with SWEBench and software agents as the motivation. However the task suddenly switches to simply predicting execution errors. The paper never circles back on how this prediction could actually help agent/systems that generate code. This makes the overall motivation of the paper quite weak in my opinion. While it is clear that setting up execution-environments for real-world code is challenging and an execution-free analyzer helps, the paper does not

Reviewer 02Rating 3Confidence 4

Strengths

1. important problem. Coding agents face a very big problem of generating buggy and unreadable codes. Automatically generated codes are difficult to debug manually. It is important to find automatic ways of analyzing LLM-generated codes. The authors provide an alternative solution in this direction. The proposed hybrid approach is reasonable and intuitive. Undoubtedly, LLMs can help to some extent find bugs in programs and could be leveraged with other algorithm-based approaches such as static

Weaknesses

1. The capability of detecting bugs purely relies on backend LLMs and static analysis tools. I didn't see how the two approaches can be deeply integrated and collaborated. Apparently, the simple loose coupling of static analysis tools and LLMs cannot find all bugs in codes. Actually, no method exists to find all bugs. It would be very important to investigate collaborative ways of leveraging the advantages of tools and LLMs to find bugs or errors as many as possible. The current integration appr

Reviewer 03Rating 3Confidence 3

Strengths

- Originality REDO combines static analysis with LLMs, and although similar approaches are becoming more common, this implementation remains useful. - Quality The empirical evaluation of REDO provides meaningful insights into the effectiveness of the proposed approach. - Clarity Overall, the paper is not difficult to follow. - Significance REDO addresses an important issue by ensuring program code security with LLMs.

Weaknesses

- Overclaim This paper claims to focus on error detection for coding agents; however, the REDO approach appears to be a more broadly applicable method for general program code error detection, which creates a misalignment in the work's focus. - Limited Novelty REDO combines PyRight with LLMs in a relatively straightforward way, with the LLM engaged only when PyRight deems the code safe. There seems to be no interaction or integration between the static analysis tool and the LLM beyond this se

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems · Distributed systems and fault tolerance · Real-Time Systems Scheduling

MethodsFocus