IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks

Spencer Mateega; Jeff Yang; Tiana Costello; Shaurya Jadhav; Nicole Tian; Agustin Garcinu\~no

arXiv:2601.20886·cs.SE·February 2, 2026

IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks

Spencer Mateega, Jeff Yang, Tiana Costello, Shaurya Jadhav, Nicole Tian, Agustin Garcinu\~no

PDF

Open Access

TL;DR

IDE-Bench is a comprehensive framework for evaluating AI-powered IDE agents on real-world software engineering tasks across multiple languages and stacks, using a structured, IDE-native interface.

Contribution

It introduces a Dockerized test harness with high-level abstractions for code search, editing, and testing, enabling systematic evaluation of AI IDE agents in realistic scenarios.

Findings

01

First benchmark to correlate agent intent with project success in multi-language environments.

02

Evaluates agents on tasks like feature development, bug fixing, and refactoring.

03

Provides a public leaderboard for ongoing benchmarking.

Abstract

IDE-Bench is a comprehensive framework for evaluating AI IDE agents on real-world software engineering tasks through an IDE-native tool interface. We present a Dockerized test harness that goes beyond raw terminal execution, granting models a structured tool ecosystem that represents AI-native IDEs like Cursor and Windsurf. By providing high-level abstractions for codebase search, structured file editing, and tools for testing full-stack applications, IDE-Bench evaluates an agent's ability to act as a true engineering collaborator. For evaluation and to prevent training data contamination, we created 80 tasks across eight never-published repositories spanning C/C++, Java, and MERN stacks, representing modern tech stack production scenarios, including feature implementation, bug fixing, refactoring, and performance optimization that mirror daily developer workflows in private codebases.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Software Engineering Research · Software Testing and Debugging Techniques