The Limits of Long-Context Reasoning in Automated Bug Fixing

Ravi Raju; Mengmeng Ji; Shubhangi Upasani; Bo Li; Urmish Thakker

arXiv:2602.16069·cs.SE·March 9, 2026

The Limits of Long-Context Reasoning in Automated Bug Fixing

Ravi Raju, Mengmeng Ji, Shubhangi Upasani, Bo Li, Urmish Thakker

PDF

Open Access

TL;DR

This paper systematically evaluates the long-context reasoning capabilities of current large language models in automated bug fixing, revealing significant limitations despite recent performance gains in software engineering benchmarks.

Contribution

It provides a comprehensive analysis showing that current LLMs struggle with genuine long-context reasoning in bug fixing, emphasizing the gap between nominal and effective context lengths.

Findings

01

Performance drops sharply at 64k token context

02

Agentic success mainly from short-context task decomposition

03

Systematic failure modes include hallucinated diffs and incorrect targets

Abstract

Rapidly increasing context lengths have led to the assumption that large language models (LLMs) can directly reason over entire codebases. Concurrently, recent advances in LLMs have enabled strong performance on software engineering benchmarks, particularly when paired with agentic workflows. In this work, we systematically evaluate whether current LLMs can reliably perform long-context code debugging and patch generation. Using SWE-bench Verified as a controlled experimental setting, we first evaluate state-of-the-art models within an agentic harness (mini-SWE-agent), where performance improves substantially: GPT-5-nano achieves up to a 31\% resolve rate on 100 samples, and open-source models such as Deepseek-R1-0528 obtain competitive results. However, token-level analysis shows that successful agentic trajectories typically remain under 20k-30k tokens, and that longer accumulated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Software System Performance and Reliability