AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents
Bhanu Prakash Vangala, Ali Adibifar, Ashish Gehani, Tanu Malik

TL;DR
This empirical study examines the reproducibility of LLM-generated code, revealing that only about two-thirds of projects run successfully in clean environments and highlighting significant hidden dependencies.
Contribution
The paper introduces a dependency framework and provides the first large-scale empirical analysis of reproducibility issues in LLM-based coding agents.
Findings
68.3% of projects execute successfully out-of-the-box
Substantial variation in reproducibility across programming languages
Average expansion of dependencies by 13.5 times from declared to runtime
Abstract
The rise of Large Language Models (LLMs) as coding agents promises to accelerate software development, but their impact on generated code reproducibility remains largely unexplored. This paper presents an empirical study investigating whether LLM-generated code can be executed successfully in a clean environment with only OS packages and using only the dependencies that the model specifies. We evaluate three state-of-the-art LLM coding agents (Claude Code, OpenAI Codex, and Gemini) across 300 projects generated from 100 standardized prompts in Python, JavaScript, and Java. We introduce a three-layer dependency framework (distinguishing between claimed, working, and runtime dependencies) to quantify execution reproducibility. Our results show that only 68.3% of projects execute out-of-the-box, with substantial variation across languages (Python 89.2%, Java 44.0%). We also find a 13.5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Software System Performance and Reliability
