Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

Kelin Fu; Tianyu Liu; Zeyu Shang; Yingwei Ma; Jian Yang; Jiaheng Liu; Kaigui Bian

arXiv:2512.06915·cs.SE·December 15, 2025

Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering

Kelin Fu, Tianyu Liu, Zeyu Shang, Yingwei Ma, Jian Yang, Jiaheng Liu, Kaigui Bian

PDF

Open Access

TL;DR

This paper introduces Multi-Docker-Eval, a benchmark for evaluating automated environment building in software engineering, revealing current limitations of large language models and agent frameworks in achieving reliable, scalable automation.

Contribution

The paper presents a new benchmark for environment configuration in SWE, providing a standardized evaluation platform and insights into the performance of current models and frameworks.

Findings

01

Low success rate of current models (up to 37.7%) in environment configuration.

02

Model size and reasoning length are not the main success factors.

03

Open-source models like DeepSeek-V3.1 and Kimi-K2 perform competitively.

Abstract

Automated environment configuration is a critical bottleneck in scaling software engineering (SWE) automation. To provide a reliable evaluation standard for this task, we present Multi-Docker-Eval benchmark. It includes 40 real-world repositories spanning 9 programming languages and measures both success in achieving executable states and efficiency under realistic constraints. Our extensive evaluation of state-of-the-art LLMs and agent frameworks reveals key insights: (1) the overall success rate of current models is low (F2P at most 37.7%), with environment construction being the primary bottleneck; (2) model size and reasoning length are not decisive factors, and open-source models like DeepSeek-V3.1 and Kimi-K2 are competitive in both efficiency and effectiveness; (3) agent framework and programming language also have significantly influence on success rate. These findings provide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Software Engineering Techniques and Practices · Software System Performance and Reliability