LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Jielin Qiu; Zuxin Liu; Zhiwei Liu; Rithesh Murthy; Jianguo Zhang; Haolin Chen; Shiyu Wang; Ming Zhu; Liangwei Yang; Juntao Tan; Roshan Ram; Akshara Prabhakar; Tulika Awalgaonkar; Zixiang Chen; Zhepeng Cen; Cheng Qian; Shelby Heinecke; Weiran Yao; Silvio Savarese; Caiming Xiong; Huan Wang

arXiv:2511.13998·cs.SE·November 19, 2025

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

Jielin Qiu, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Jianguo Zhang, Haolin Chen, Shiyu Wang, Ming Zhu, Liangwei Yang, Juntao Tan, Roshan Ram, Akshara Prabhakar, Tulika Awalgaonkar, Zixiang Chen, Zhepeng Cen, Cheng Qian, Shelby Heinecke, Weiran Yao, Silvio Savarese, Caiming Xiong

PDF

Open Access

TL;DR

LoCoBench-Agent is a new comprehensive benchmark framework designed to evaluate large language model agents in realistic, long-context software engineering tasks, focusing on multi-turn interactions, tool usage, and efficiency across extended sessions.

Contribution

It introduces a novel interactive evaluation framework with 9 metrics, 8 tools, and long-context assessment up to 1 million tokens, filling a gap in existing benchmarks.

Findings

01

Agents show strong long-context robustness.

02

A negative correlation exists between comprehension and efficiency.

03

Conversation efficiency varies widely across models.

Abstract

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~\cite{qiu2025locobench} assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce \textbf{LoCoBench-Agent}, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Multi-Agent Systems and Negotiation