TL;DR
This paper introduces Code Researcher, a deep research agent leveraging large language models to generate patches for large systems code, significantly improving crash resolution by multi-step reasoning and context retrieval.
Contribution
It presents the first deep research agent for systems code, demonstrating improved crash resolution on Linux kernel benchmarks and showing robustness across models and codebases.
Findings
Code Researcher achieves a 48% crash-resolution rate on Linux kernel crashes.
Scaling sampling to 10 trajectories increases CRR to 54%.
The approach is robust with newer models like Gemini 2.5-Flash.
Abstract
Large Language Model (LLM)-based coding agents have shown promising results on coding benchmarks, but their effectiveness on systems code remains underexplored. Due to the size and complexities of systems code, making changes to a systems codebase requires researching about many pieces of context, derived from the large codebase and its massive commit history, before making changes. Inspired by the recent progress on deep research agents, we design the first deep research agent for code, called Code Researcher, and apply it to the problem of generating patches to mitigate crashes reported in systems code. Code Researcher performs multi-step reasoning about semantics, patterns, and commit history of code to retrieve all relevant context from the codebase and its commit history. We evaluate Code Researcher on kBenchSyz, a benchmark of Linux kernel crashes, and show that it significantly…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper focuses on fixing system code bugs, which is related to the popular SWE-bench type problems but more low level and potentially of a larger scale. Given the importance of system code, developing more effective systems for that is meaningful. Empirical evidences suggest Code Researcher outperforms existing agents for general software, such as SWE-agent and Agentless. 2. This paper contains many insightful analyses on the Code Researcher system, such as the accuracy of the Analysis p
1. The novelty for Code Researcher is limited. It follows a three-phase workflow similar to Agentless. Its Analysis phase is iterative. Likewise, SWE-agent and Openhands can also perform iterative context retrieval. What makes Code Researcher different seems to be: 1) additional tools (search_commits); 2) a specialized search strategy guidance through prompting (section 3.1.2). So the overall innovations are incremental. 2. The Pass@k metric seems incomplete as described in the paper. It is def
1. **Clear Problem Identification and Motivation**: It clearly identifies the shortcomings of existing code agents in handling large-scale system code (lack of deep context collection capabilities) and designs targeted solutions. 2. **Systematic Method Design**: - The three-stage process (analysis-synthesis-verification) is logically clear. - The reasoning strategies are reasonably designed (control/data flow tracing, pattern detection, commit history causal analysis). - The structured
### Methodological Limitations 1. **Limited Innovation**: The core method essentially applies known techniques from deep research agents (multi-step tool calls, reasoning strategies, memory mechanisms) to the code domain, lacking fundamental innovations tailored to code characteristics. 2. **Generality of Reasoning Strategies**: - The three reasoning strategies (control/data flow, pattern detection, commit history) are reasonable but are direct applications of traditional software engineerin
- To the best of my knowledge, this appears to be the first work explicitly exploring deep research within software engineering tasks, which makes it novel in scope. - The paper tackles realistic and challenging problems—specifically, software development and maintenance tasks in complex Linux kernel C/C++ codebases. Compared to prior work focusing on simpler scripting languages such as Python (e.g., SWE-bench), the chosen domain is harder, more practical, and less explored.
- I find the claim of performing “Deep Research for Code” somewhat confusing. From my understanding, project-level issue resolution inherently requires agentic search and information summarization. Thus, the conceptual gap between Code Researcher and a standard coding agent seems much smaller than the gap between a general AI chatbot and a deep research agent. Could the authors clarify what fundamental difference distinguishes a Code Researcher from a regular coding agent? Specifically, what uni
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software Engineering Research · Distributed and Parallel Computing Systems
