The (ab)use of Open Source Code to Train Large Language Models
Ali Al-Kaswan, Maliheh Izadi

TL;DR
This paper discusses the risks and ethical issues of training large language models on open source code, highlighting memorization concerns, legal dilemmas, and proposing actionable solutions.
Contribution
It analyzes the implications of using open source code in LLM training, emphasizing legal, ethical, and security challenges, and offers practical recommendations.
Findings
Models memorize and emit source code verbatim
Using copyleft code raises legal and ethical issues
Four actionable recommendations are proposed
Abstract
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
