SWE Context Bench: A Benchmark for Context Learning in Coding
Jiayuan Zhu, Junde Wu, Minhao Hu, Shengda Zhu, Jiazhen Pan, Weixiang Shen, Yijun Yang, Fenglin Liu, Jianye Hao, Yueming Jin, Qirong Ho, Min Xu

TL;DR
SWE-ContextBench is a new benchmark for evaluating how well coding agents understand and reuse context from related software engineering tasks, emphasizing retrieval accuracy and efficiency.
Contribution
It introduces a benchmark with real-world related tasks and analyzes the impact of context retrieval strategies on coding agent performance.
Findings
Accurate context retrieval improves task resolution accuracy.
Proper context management reduces runtime and token costs.
Incorrect context selection can negatively impact performance.
Abstract
Large language models are increasingly used as coding agents for software engineering tasks. Current benchmarks mainly evaluate whether the agent can correctly solve the request or fix the bugs. They largely treat tasks as independent and do not assess whether agents can reuse previous experience across related problems. As a result, the efficiency gains from reusing the previous experience remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate context understanding and retrieval in coding agents. SWE-ContextBench consists of 1,100 base tasks with another 376 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests. SWE-ContextBench groups base tasks and related tasks with shared context across 51 unique repositories and 9 programming languages. The benchmark evaluates how accurately…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
