The Limits of Long-Context Reasoning in Automated Bug Fixing
Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker

TL;DR
This paper systematically evaluates the long-context reasoning capabilities of current large language models in automated bug fixing, revealing significant limitations despite recent performance gains in software engineering benchmarks.
Contribution
It provides a comprehensive analysis showing that current LLMs struggle with genuine long-context reasoning in bug fixing, emphasizing the gap between nominal and effective context lengths.
Findings
Performance drops sharply at 64k token context
Agentic success mainly from short-context task decomposition
Systematic failure modes include hallucinated diffs and incorrect targets
Abstract
Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k-30k tokens, and that longer accumulated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability
