Characterizing the Failure Modes of LLMs in Resolving Real-World GitHub Issues
Yanjie Jiang, Yian Huang, Guancheng Wang, Junjie Chen, Hui Liu, Lionel Briand

TL;DR
This paper analyzes how large language models fail when resolving real-world GitHub issues, identifying failure stages, root causes, and suggesting improvements for reliability.
Contribution
It introduces a comprehensive failure taxonomy for LLMs in bug fixing, revealing key error-prone stages and root causes, and evaluates multiple models on a large dataset.
Findings
Strategy formulation and logic synthesis are the most error-prone stages.
LLMs excel at fault localization compared to other stages.
Robustness and operational costs vary significantly across models.
Abstract
Large Language Models (LLMs) are increasingly deployed to resolve real-world GitHub issues. However, despite their potential, the specific failure modes of these models in complex repair tasks remain poorly understood. To characterize how LLM behavior diverges from human developer practices, this paper evaluates three state-of-the-art models, i.e., Claude 4.5 Sonnet, Gemini 3 Pro, and GPT-5, on the SWE-bench Verified dataset. We conduct a rigorous manual analysis of the symptoms and root causes underlying 243 failed attempts across 900 total trials. Our investigation first yields a unified failure taxonomy encompassing five distinct stages of the repair pipeline, within which we categorize typical failure symptoms and their prevalence. Secondly, our findings reveal that for all evaluated LLMs, strategy formulation and logic synthesis constitutes the most error-prone stage, followed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
