Beyond Resolution Rates: Behavioral Drivers of Coding Agent Success and Failure
Tural Mehtiyev, Wesley Assun\c{c}\~ao

TL;DR
This study analyzes the behavioral factors influencing the success and failure of coding agents, revealing that architectural reasoning and context gathering are key, with LLM capability being the primary success driver.
Contribution
It provides a large-scale empirical analysis of coding agent failures, highlighting the importance of behavioral patterns and LLM capabilities over framework design.
Findings
Agents fail on simple tasks due to reasoning and knowledge gaps.
Gathering context and validation strategies correlate with success.
LLM capability is the main factor influencing agent performance and behavior.
Abstract
Coding agents represent a new paradigm in automated software engineering, combining the reasoning capabilities of Large Language Models (LLMs) with tool-augmented interaction loops. However, coding agents still have severe limitations. Top-ranked LLM-based coding agents still fail on over 20% of benchmarked problems. Yet, we lack a systematic understanding of why (i.e., the causes) agents fail, and how failure unfolds behaviorally. We present a large-scale empirical study analyzing 9,374 trajectories from 19 agents (8 coding agent frameworks, 14 LLMs) on 500 tasks. We organize our analysis around three research questions. First, we investigate why agents fail on specific tasks and find that patch complexity alone does not explain difficulty: 12 never-solved tasks require only simple patches and were considered easy by human annotators, yet all agents fail due to gaps in architectural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
