Detecting Memorization in Large Language Models

Eduardo Slonski

arXiv:2412.01014·cs.LG·December 3, 2024

Detecting Memorization in Large Language Models

Eduardo Slonski

PDF

Open Access

TL;DR

This paper presents a precise method for detecting and controlling memorization in large language models by analyzing neuron activations, improving interpretability and evaluation accuracy.

Contribution

It introduces an activation-based detection technique that accurately identifies memorized tokens and enables suppression without harming model performance.

Findings

01

Near-perfect accuracy in memorization detection

02

Effective suppression of memorization through activation intervention

03

Versatile application to repetition detection

Abstract

Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling