What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

Dimitri Staufer

arXiv:2507.11128·cs.CL·July 16, 2025

What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests

Dimitri Staufer

PDF

Open Access

TL;DR

This paper introduces a dataset and metric to identify and quantify personal data memorized by LLMs, facilitating compliance with GDPR's Right to Be Forgotten at the individual level.

Contribution

It presents WikiMem, a new dataset, and a model-agnostic metric for detecting personal data in LLMs, enabling targeted unlearning and privacy compliance.

Findings

01

Memorization correlates with web presence and model size.

02

The metric effectively ranks factual associations in LLMs.

03

Evaluation across multiple models demonstrates practical applicability.

Abstract

Large Language Models (LLMs) can memorize and reveal personal information, raising concerns regarding compliance with the EU's GDPR, particularly the Right to Be Forgotten (RTBF). Existing machine unlearning methods assume the data to forget is already known but do not address how to identify which individual-fact associations are stored in the model. Privacy auditing techniques typically operate at the population level or target a small set of identifiers, limiting applicability to individual-level data inquiries. We introduce WikiMem, a dataset of over 5,000 natural language canaries covering 243 human-related properties from Wikidata, and a model-agnostic metric to quantify human-fact associations in LLMs. Our approach ranks ground-truth values against counterfactuals using calibrated negative log-likelihood across paraphrased prompts. We evaluate 200 individuals across 15 LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security · Artificial Intelligence in Law

MethodsCounterfactuals Explanations · Sparse Evolutionary Training