TL;DR
This paper shows that for complex tasks like SWE-bench, a long context language model (LCLM) with proper prompting can match or surpass the performance of more complex, scaffolded LM agent architectures, simplifying the approach.
Contribution
It demonstrates that removing scaffolds and tools from LM agents, using LCLMs with effective prompting, can achieve competitive results on challenging tasks.
Findings
Gemini-1.5-Pro achieves 38% on SWE-Bench-Verified without scaffolds.
Gemini-2.5-Pro attains 50.8% solve rate unscaffolded.
Two-stage Gemini-1.5-Pro and Claude-3.7 approach reaches 48.6%.
Abstract
Recent advances in language model (LM) agents have demonstrated significant potential for automating complex real-world tasks. To make progress on these difficult tasks, LM agent architectures have become increasingly complex, often incorporating multi-step retrieval tools, multiple agents, and scaffolding adapted to the underlying LM. In this work, we investigate whether all of this complexity is necessary, or if parts of these scaffolds can be removed on challenging tasks like SWE-bench. We show that in the case of SWE-bench, simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model makes it competitive with carefully tuned, complex agent scaffolds. We show that a Gemini-1.5-Pro model without any scaffolding or tools achieves 38% on SWE-Bench-Verified, comparable with approaches using carefully tuned agent scaffolds…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper's core research question is simple, timely, and contrarian. I think this state-in-context design is a useful study case for the agent design space. - The problem is well motivated. The paper spent quite some time and analysis to show that a lot of coding problems have context that can be reasonable reduced to below 2m. This helps build the case that we can have the proposed state in context. - The paper's findings are meaningful. It provides a powerful, simple baseline that, with a s
```Long context performance is a bit contradictory``` Figure 4 and Table 5 show that LCLM performance decreases as context length grows and is highly sensitive to the position of the target file ("lost in the middle"). This strongly suggests that current LCLMs are not effective at "in-context retrieval" over very long, noisy inputs. This finding, which is central to the paper's thesis, should be in the main body. It weakens the "monolithic" DIRECTSOLVE argument and implies that a smarter select
- Clear and well-motivated research question: The authors identify a meaningful gap—whether agentic complexity is necessary in fully observable environments.
- Critical Methodological Contradiction Between Core Claims and Actual Implementation. The paper emphasizes that it explores "simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model" (Abstract, line 17), while criticizing existing work for requiring "a careful design of agentic scaffoldings tailored to specific tasks" (line 192). However, the actual implementation still follows the agentic workflow approach: - In "3.2 St
1.Highly Original: Proposed "state-in-context" agents, revolutionizing traditional multi-round interactive agent design. 2.Simple Method: No complex toolchain required, relying solely on prompts and context to complete complex tasks. 3.Excellent Performance: Gemini-2.5-Pro's DIRECTSOLVE method achieved a 50.8% success rate, surpassing most scaffolding systems. 4.Scalability: Supports context compression and multi-model collaboration (e.g., SELECTSOLVE), adapting to diverse models and tasks.
1. According to public information, the SWE-BENCH-VERIFIED index is already very high (78.80%). You should put more experimental results in the experimental section to let everyone confirm the feasibility of your method. 2. There are also some other mainstream benchmarks in the current code field or complex task field. I hope to provide some more experimental results and conclusions, which will make the method more convincing. 3. Your references contain many methods from the past, but few new
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
