Comparing Human and LLM Generated Code: The Jury is Still Out!
Sherlock A. Licorish, Ansh Bajpai, Chetan Arora, Fanyu Wang, Kla, Tantithamthavorn

TL;DR
This study compares human and GPT-4 generated Python code across 72 tasks, evaluating quality, standards, security, complexity, and correctness, revealing strengths and limitations of LLMs in software development.
Contribution
It provides a comprehensive benchmark analysis of human versus GPT-4 code, highlighting areas where LLMs excel and where humans outperform them in software engineering tasks.
Findings
Humans adhere better to coding standards.
GPT-4 code passes more test cases.
Humans exhibit more diverse security issues.
Abstract
Much is promised in relation to AI-supported software development. However, there has been limited evaluation effort in the research domain aimed at validating the true utility of such techniques, especially when compared to human coding outputs. We bridge this gap, where a benchmark dataset comprising 72 distinct software engineering tasks is used to compare the effectiveness of large language models (LLMs) and human programmers in producing Python software code. GPT-4 is used as a representative LLM, where for the code generated by humans and this LLM, we evaluate code quality and adherence to Python coding standards, code security and vulnerabilities, code complexity and functional correctness. We use various static analysis benchmarks, including Pylint, Radon, Bandit and test cases. Among the notable outcomes, results show that human-generated code recorded higher ratings for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Legal Education and Practice Innovations · Law, AI, and Intellectual Property
