Where Is Self-admitted Code Generated by Large Language Models on GitHub?
Xiao Yu, Lei Liu, Xing Hu, Jin Liu, Xin Xia

TL;DR
This study analyzes real-world GitHub projects to identify where developers admit using LLMs like ChatGPT and Copilot for code generation, revealing their prevalence, characteristics, and minimal modifications in open-source software.
Contribution
It provides the first large-scale analysis of self-admitted LLM-generated code on GitHub, highlighting usage patterns, project types, and code modification behaviors.
Findings
ChatGPT and Copilot dominate LLM-generated code on GitHub.
Most LLM-generated code is in small, evolving projects with simple snippets.
Minimal modifications are made to LLM-generated code, often just 4-12%.
Abstract
The increasing use of Large Language Models (LLMs) in software development has garnered significant attention from researchers evaluating the capabilities and limitations of LLMs for code generation. However, much of the research focuses on controlled datasets such as HumanEval, which do not adequately capture the characteristics of LLM-generated code in real-world development scenarios. To address this gap, our study investigates self-admitted code generated by LLMs on GitHub, specifically focusing on instances where developers in projects with over five stars acknowledge the use of LLMs to generate code through code comments. Our findings reveal several key insights: (1) ChatGPT and Copilot dominate code generation, with minimal contributions from other LLMs. (2) Projects containing ChatGPT/Copilot-generated code appears in small/medium-sized projects led by small teams, which are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Model-Driven Software Engineering Techniques · Natural Language Processing Techniques
