Understanding Verbatim Memorization in LLMs Through Circuit Discovery
Ilya Lasy, Peter Knees, Stefan Woltran

TL;DR
This paper investigates how large language models memorize data by analyzing transformer circuits, identifying specific mechanisms responsible for initiating and maintaining memorization, and examining how these processes transfer across contexts.
Contribution
It introduces a circuit-based interpretability approach to understand memorization in LLMs, revealing distinct circuits for initiation and maintenance of memorization.
Findings
Circuits that initiate memorization can also sustain it.
Memorization prevention mechanisms transfer across domains.
Memorization induction is highly context-dependent.
Abstract
Underlying mechanisms of memorization in LLMs -- the verbatim reproduction of training data -- remain poorly understood. What exact part of the network decides to retrieve a token that we would consider as start of memorization sequence? How exactly is the models' behaviour different when producing memorized sentence vs non-memorized? In this work we approach these questions from mechanistic interpretability standpoint by utilizing transformer circuits -- the minimal computational subgraphs that perform specific functions within the model. Through carefully constructed contrastive datasets, we identify points where model generation diverges from memorized content and isolate the specific circuits responsible for two distinct aspects of memorization. We find that circuits that initiate memorization can also maintain it once started, while circuits that only maintain memorization cannot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
