Traces of Memorisation in Large Language Models for Code

Ali Al-Kaswan; Maliheh Izadi; Arie van Deursen

arXiv:2312.11658·cs.CR·January 17, 2024·1 cites

Traces of Memorisation in Large Language Models for Code

Ali Al-Kaswan, Maliheh Izadi, Arie van Deursen

PDF

Open Access 1 Repo

TL;DR

This paper investigates memorisation in large language models for code, revealing their vulnerability to data extraction attacks and highlighting the need for safeguards to prevent data leakage.

Contribution

It introduces a benchmark for assessing memorisation in code models and compares memorisation rates across models and data types, revealing significant vulnerabilities.

Findings

01

47% of extractable data from CodeGen-Mono-16B

02

Memorisation increases with model size

03

Data carriers are more memorised than regular code

Abstract

Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare the rate of memorisation with large language models trained on natural language. We adopt an existing benchmark for natural language and construct a benchmark for code by identifying samples that are vulnerable to attack. We run both benchmarks against a variety of models, and perform a data extraction attack. We find that large language models for code are vulnerable to data extraction attacks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aise-tudelft/llm4code-extraction
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Advanced Malware Detection Techniques · Topic Modeling