FeatureBench: Benchmarking Agentic Coding for Complex Feature Development
Qixing Zhou, Jiacheng Zhang, Haiyang Wang, Rui Hao, Jiahe Wang, Minghao Han, Yuxue Yang, Shuzhe Wu, Feiyang Pan, Lue Fan, Dandan Tu, Zhaoxiang Zhang

TL;DR
FeatureBench is a scalable, execution-based benchmark for evaluating large language model agents in complex, feature-oriented software development tasks, revealing current limitations and guiding future improvements.
Contribution
It introduces an automated, scalable framework for creating end-to-end, feature-level coding tasks from real repositories, expanding evaluation scope beyond simple bug fixing.
Findings
State-of-the-art models resolve only 11% of tasks
FeatureBench includes 200 challenging tasks and 3825 environments
Automated task collection enables scalable, up-to-date benchmarking
Abstract
Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a…
Peer Reviews
Decision·ICLR 2026 Poster
- The benchmark targets feature-level development (not just bug fixes) and pairs each task in two modes—L1 (extend an existing repo) and L2 (from scratch)—a clean formulation that isolates the role of context and raises the ceiling on task complexity. - The evaluation is execution-based, with explicit interfaces and anti-cheating controls; the pipeline includes post-verification and ablations (e.g., hiding interfaces, step budgets, visible tests), plus clear metrics (Resolved/Passed/Token I/O).
- Positioning vs. closely related work needs to be sharper. The paper should more directly compare and differentiate from SWE-Dev (feature-driven development on large existing codebases with runnable environments; 14k train / 500 test and developer-authored unit tests) and commit0 (from-scratch library generation with API spec + interactive tests). - Dataset composition skew. Although spanning 16 repos, the task mass is concentrated (e.g., Transformers dominates), which risks domain bias and ma
* The authors don't base their dataset on already existing ones but scrape their own data which lessens the risk of data leakage * The paper is well written and easy to follow. Visualization illustrate the core aspects of the work well. * Assessing feature development capabilities is an important area which is under-explored * The dataset is seems to be significantly more complex in terms of gold solution lines, files, functions and number of tests. * The graph-based function extraction is nove
* The authors do not provide a lot of analysis to show that their tasks are truly solvable. Given that the problem statements are LLM generated, this needs to be shown. The authors propose that AssertionErrors indicate problem statements contain sufficient information. However, runnable code does not correlate with solvability of the tasks. * The data set is Python only which severely limits to which degree one can measure coding agent performance. * Only a single agent (OpenHands) is evaluated
- Evaluating feature-level implementation is both novel and important. As evidenced by recent SWE-bench leaderboard results, modern coding agents can perform bug-fixing tasks with high accuracy. However, their capability to handle feature-level implementations remains largely unexplored. This paper addresses this gap by providing a benchmark specifically designed to evaluate this capability. - The benchmark is designed with usability in mind. Given that evaluating the full set requires approxim
- Allowing unrestricted library usage may enable agents to complete tasks by simply calling existing library functions, essentially testing library knowledge rather than implementation capability (The benchmark allows agents to use pip install to add arbitrary libraries (Figure 13)). While the authors prevent accessing ground-truth implementations via anti-cheating mechanisms,the policy on legitimate library usage remains unclear. The authors should clarify whether the evaluation assesses (a) th
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Advanced Software Engineering Methodologies · Software Engineering Research
