The (ab)use of Open Source Code to Train Large Language Models

Ali Al-Kaswan; Maliheh Izadi

arXiv:2302.13681·cs.SE·March 1, 2023·1 cites

The (ab)use of Open Source Code to Train Large Language Models

Ali Al-Kaswan, Maliheh Izadi

PDF

Open Access 2 Repos

TL;DR

This paper discusses the risks and ethical issues of training large language models on open source code, highlighting memorization concerns, legal dilemmas, and proposing actionable solutions.

Contribution

It analyzes the implications of using open source code in LLM training, emphasizing legal, ethical, and security challenges, and offers practical recommendations.

Findings

01

Models memorize and emit source code verbatim

02

Using copyleft code raises legal and ethical issues

03

Four actionable recommendations are proposed

Abstract

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling