TL;DR
This paper introduces Enconda-bench, a novel benchmark for detailed process-level evaluation of environment configuration capabilities in software engineering agents, revealing their strengths and limitations beyond simple success metrics.
Contribution
It presents the first framework for process-level internal capability assessment in environment configuration, enabling detailed analysis of agent performance and error diagnosis.
Findings
Agents can localize errors effectively.
Agents struggle to convert feedback into successful corrections.
Enconda-bench enables scalable, high-quality evaluation of agent capabilities.
Abstract
Large language model-based agents show promise for software engineering, but environment configuration remains a bottleneck due to heavy manual effort and scarce large-scale, high-quality datasets. Existing benchmarks assess only end-to-end build/test success, obscuring where and why agents succeed or fail. We introduce the Environment Configuration Diagnosis Benchmark, Enconda-bench, which provides process-level trajectory assessment of fine-grained agent capabilities during environment setup-planning, perception-driven error diagnosis, feedback-driven repair, and action to execute final environment configuration. Our task instances are automatically constructed by injecting realistic README errors and are validated in Docker for scalable, high-quality evaluation. Enconda-bench combines process-level analysis with end-to-end executability to enable capability assessments beyond…
Peer Reviews
Decision·ICLR 2026 Poster
- tackles a very important and currently understudied problem of using llm agents to build software development environments - the benchmark is constructed nicely with detailed descriptions of the pipeline and manual examination, which can be used by future work - the authors also evaluate the benchmark already on several important baselines together with detailed analysis
Missing key evaluation category: - As the author demonstrated in the paper and from prior work, developing a script that can successfully build an environment is non-trival. - In the paper the authors focus on the task of repairing a README with errors - However, we can also easily use the benchmark without README with errors to evaluate given a correct README what are the performance of generating a correct environment using LLM agents. - I think this is an interesting scenario and can allow th
* This paper identifies an important issue in agentic coding and proposes a targeted dataset and benchmark to enable future research. * While there are no technical contributions beyond the dataset and some of the analysis, the evaluation suite does seem to offer some key benefits over previous work, in particular when it comes to the “process-level” evaluation. * The dataset creation procedure is thorough and well-explained; I think this will be a useful resource for the community.
* There are some notable omissions in the evaluations, such a GPT-5-Codex and Claude 4.5, both of which are considered SOTA base models for coding. Furthermore, for the coding agents, why not include Codex CLI, Gemini CLI, Jules, and Claude Code? These are specifically optimized to handle novel codebases and deal with configuration issues. * I’m not sure that ICLR is the best venue for this work. Perhaps a dedicated dataset/benchmark track would be better suited. * Some of the figures are too sm
1. This work addresses a critcal yet unexplored bottleneck, moving from end-to-end pass/fail metrics to process-level trajectory assessment. This is useful to extract actionalble feedback that is useful for agent designers. 2. The decomposition of evaluation into the perception, feedback, and action provides fine-grained diagnostics beyond aggregate success metrics.
1. The six error types chosen are said to be "guided by failure modes frequently encountered in practice", but no citation or prior empirical study is provided to substantiate this taxonomy. Without grounding in developer-observed data, it is unclear whether these six types cover real-world failure modes, or simply reflect intuitive assumptions. 2. Each erroneous README is created by injecting two errors per file. This raises two concerns: (i) since both synthesis and evaluation rely on LLM beh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
