Can LLMs be Effective Code Contributors? A Study on Open-source Projects
Chun Jie Chong, Muyeed Ahmed, Zhihao (Zephyr) Yao, Iulian Neamtiu

TL;DR
This study evaluates the effectiveness of large language models in contributing code to open-source projects, revealing significant shortcomings and variability in success rates across different projects and models.
Contribution
The paper introduces a framework for assessing LLMs' suitability for code contributions and provides empirical results on their performance in real open-source projects.
Findings
LLMs' success rate ranged from 0% to 60% across projects.
LLMs often generated syntactically incorrect or unverified code.
Struggles include generating new code and managing context size.
Abstract
LLM-generated code is widely used, and the share of committed code produced by LLMs is expected to increase. However, we are not at a point where LLMs can be effective contributors to production code. We present an approach that exposes the shortcomings of LLM generation on such projects, and proposes recommendations; the targets of our study are sizable open-source projects, e.g., FFmpeg and wolfSSL. First, we developed a framework that uses verification and validation to evaluate a given LLM's suitability to fix or add features to an existing project. Second, we apply the framework to 212 commits (bug fixes and small feature improvements) in eight popular open-source projects and three LLMs: GPT-4o, Ministral3, and Qwen3-Coder. The success rate varied from 0% to 60% depending on the project. The LLMs failed in a variety of ways, from generating syntactically incorrect code, to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
