LiCoEval: Evaluating LLMs on License Compliance in Code Generation
Weiwei Xu, Kai Gao, Hao He, Minghui Zhou

TL;DR
This paper introduces LiCoEval, a benchmark for assessing LLMs' ability to provide accurate license information for generated code, revealing significant shortcomings in current models' compliance with open-source licenses.
Contribution
It establishes a novel benchmark and empirical standard for license similarity, evaluating 14 LLMs' license compliance capabilities in code generation.
Findings
Top LLMs produce 0.88% to 2.01% code with striking similarity to open-source code.
Most LLMs fail to provide correct license information, especially for copyleft licenses.
The study highlights the urgent need to improve LLM license compliance in code generation.
Abstract
Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for "striking similarity" that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose LiCoEval, to evaluate the license…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
