Guess or Recall? Training CNNs to Classify and Localize Memorization in LLMs
J\'er\'emie Dentan, Davide Buscaldi, Sonia Vanier

TL;DR
This paper introduces a new taxonomy and interpretability method to analyze and localize different forms of memorization in LLMs by training CNNs on attention weights, revealing insights into how models memorize and recall information.
Contribution
The paper proposes a novel taxonomy aligned with attention weights and a visualization technique to distinguish and localize memorization mechanisms in LLMs, improving understanding of model behavior.
Findings
Existing taxonomy poorly reflects attention mechanisms
Most memorized samples are guessed, not recalled
Few-shot memorization is not a distinct attention process
Abstract
Verbatim memorization in Large Language Models (LLMs) is a multifaceted phenomenon involving distinct underlying mechanisms. We introduce a novel method to analyze the different forms of memorization described by the existing taxonomy. Specifically, we train Convolutional Neural Networks (CNNs) on the attention weights of the LLM and evaluate the alignment between this taxonomy and the attention weights involved in decoding. We find that the existing taxonomy performs poorly and fails to reflect distinct mechanisms within the attention blocks. We propose a new taxonomy that maximizes alignment with the attention weights, consisting of three categories: memorized samples that are guessed using language modeling abilities, memorized samples that are recalled due to high duplication in the training set, and non-memorized samples. Our results reveal that few-shot verbatim memorization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
