Quantifying and Analyzing Entity-level Memorization in Large Language   Models

Zhenhong Zhou; Jiuyang Xiang; Chaomeng Chen; Sen Su

arXiv:2308.15727·cs.CL·November 7, 2023·1 cites

Quantifying and Analyzing Entity-level Memorization in Large Language Models

Zhenhong Zhou, Jiuyang Xiang, Chaomeng Chen, Sen Su

PDF

Open Access

TL;DR

This paper introduces a fine-grained method to quantify and analyze entity-level memorization in large language models, revealing their strong ability to memorize and reproduce sensitive training data, which raises privacy concerns.

Contribution

It proposes a new entity-level quantification approach and an efficient method for extracting sensitive entities, enabling practical privacy risk assessment in large language models.

Findings

01

LLMs exhibit strong entity-level memorization.

02

Models can reproduce training data with partial leaks.

03

Entities are associated and understood by LLMs.

Abstract

Large language models (LLMs) have been proven capable of memorizing their training data, which can be extracted through specifically designed prompts. As the scale of datasets continues to grow, privacy risks arising from memorization have attracted increasing attention. Quantifying language model memorization helps evaluate potential privacy risks. However, prior works on quantifying memorization require access to the precise original data or incur substantial computational overhead, making it difficult for applications in real-world language models. To this end, we propose a fine-grained, entity-level definition to quantify memorization with conditions and metrics closer to real-world scenarios. In addition, we also present an approach for efficiently extracting sensitive entities from autoregressive language models. We conduct extensive experiments based on the proposed, probing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPrivacy-Preserving Technologies in Data