Putting It All into Context: Simplifying Agents with LCLMs

Mingjian Jiang; Yangjun Ruan; Luis Lastras; Pavan Kapanipathi; Tatsunori Hashimoto

arXiv:2505.08120·cs.CL·May 14, 2025

Putting It All into Context: Simplifying Agents with LCLMs

Mingjian Jiang, Yangjun Ruan, Luis Lastras, Pavan Kapanipathi, Tatsunori Hashimoto

PDF

3 Reviews

TL;DR

This paper shows that for complex tasks like SWE-bench, a long context language model (LCLM) with proper prompting can match or surpass the performance of more complex, scaffolded LM agent architectures, simplifying the approach.

Contribution

It demonstrates that removing scaffolds and tools from LM agents, using LCLMs with effective prompting, can achieve competitive results on challenging tasks.

Findings

01

Gemini-1.5-Pro achieves 38% on SWE-Bench-Verified without scaffolds.

02

Gemini-2.5-Pro attains 50.8% solve rate unscaffolded.

03

Two-stage Gemini-1.5-Pro and Claude-3.7 approach reaches 48.6%.

Abstract

Recent advances in language model (LM) agents have demonstrated significant potential for automating complex real-world tasks. To make progress on these difficult tasks, LM agent architectures have become increasingly complex, often incorporating multi-step retrieval tools, multiple agents, and scaffolding adapted to the underlying LM. In this work, we investigate whether all of this complexity is necessary, or if parts of these scaffolds can be removed on challenging tasks like SWE-bench. We show that in the case of SWE-bench, simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model makes it competitive with carefully tuned, complex agent scaffolds. We show that a Gemini-1.5-Pro model without any scaffolding or tools achieves 38% on SWE-Bench-Verified, comparable with approaches using carefully tuned agent scaffolds…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The paper's core research question is simple, timely, and contrarian. I think this state-in-context design is a useful study case for the agent design space. - The problem is well motivated. The paper spent quite some time and analysis to show that a lot of coding problems have context that can be reasonable reduced to below 2m. This helps build the case that we can have the proposed state in context. - The paper's findings are meaningful. It provides a powerful, simple baseline that, with a s

Weaknesses

```Long context performance is a bit contradictory``` Figure 4 and Table 5 show that LCLM performance decreases as context length grows and is highly sensitive to the position of the target file ("lost in the middle"). This strongly suggests that current LCLMs are not effective at "in-context retrieval" over very long, noisy inputs. This finding, which is central to the paper's thesis, should be in the main body. It weakens the "monolithic" DIRECTSOLVE argument and implies that a smarter select

Reviewer 02Rating 4Confidence 4

Strengths

- Clear and well-motivated research question: The authors identify a meaningful gap—whether agentic complexity is necessary in fully observable environments.

Weaknesses

- Critical Methodological Contradiction Between Core Claims and Actual Implementation. The paper emphasizes that it explores "simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model" (Abstract, line 17), while criticizing existing work for requiring "a careful design of agentic scaffoldings tailored to specific tasks" (line 192). However, the actual implementation still follows the agentic workflow approach: - In "3.2 St

Reviewer 03Rating 2Confidence 4

Strengths

1.Highly Original: Proposed "state-in-context" agents, revolutionizing traditional multi-round interactive agent design. 2.Simple Method: No complex toolchain required, relying solely on prompts and context to complete complex tasks. 3.Excellent Performance: Gemini-2.5-Pro's DIRECTSOLVE method achieved a 50.8% success rate, surpassing most scaffolding systems. 4.Scalability: Supports context compression and multi-model collaboration (e.g., SELECTSOLVE), adapting to diverse models and tasks.

Weaknesses

1. According to public information, the SWE-BENCH-VERIFIED index is already very high (78.80%). You should put more experimental results in the experimental section to let everyone confirm the feasibility of your method. 2. There are also some other mainstream benchmarks in the current code field or complex task field. I hope to provide some more experimental results and conclusions, which will make the method more convincing. 3. Your references contain many methods from the past, but few new

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.