Detecting Memorization in Large Language Models
Eduardo Slonski

TL;DR
This paper presents a precise method for detecting and controlling memorization in large language models by analyzing neuron activations, improving interpretability and evaluation accuracy.
Contribution
It introduces an activation-based detection technique that accurately identifies memorized tokens and enables suppression without harming model performance.
Findings
Near-perfect accuracy in memorization detection
Effective suppression of memorization through activation intervention
Versatile application to repetition detection
Abstract
Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
