ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution
Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, Xipeng Qiu

TL;DR
ResearchEnvBench is a new benchmark that evaluates autonomous agents on their ability to synthesize execution environments for research code, addressing a critical gap in reproducibility and dependency management.
Contribution
It introduces a benchmark for environment synthesis in research code execution, highlighting current limitations and guiding future improvements in autonomous research agents.
Findings
Current SOTA agents often fail due to incomplete dependency resolution.
Failures are mainly caused by brittle version coupling and environment misconfigurations.
ResearchEnvBench offers a realistic testbed for advancing reproducible scientific research.
Abstract
Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution environment, which requires resolving complex software dependencies, aligning hardware and framework versions, and configuring distributed execution, yet this capability remains largely unbenchmarked. We introduce ResearchEnvBench, a benchmark for environment synthesis in research code execution. Given a research repository, documentation, and a target execution setting, agents must construct an environment that successfully executes at runtime. Evaluations on diverse research repositories reveal a substantial gap in current SOTA agents, with failures dominated by incomplete dependency resolution and brittle version coupling. ResearchEnvBench provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Model-Driven Software Engineering Techniques · Software Testing and Debugging Techniques
