Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
USVSN Sai Prashanth, Alvin Deng, Kyle O'Brien, Jyothir S V, Mohammad, Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne,, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

TL;DR
This paper proposes a nuanced taxonomy of memorization in language models, categorizing it into recitation, reconstruction, and recollection, and demonstrates how different factors influence memorization likelihood across these categories.
Contribution
It introduces a detailed taxonomy of memorization in language models and develops a predictive model to analyze how various factors affect memorization.
Findings
Different factors influence memorization depending on the category.
A predictive model can effectively classify types of memorization.
Analysis reveals distinct dependencies for each memorization category.
Abstract
Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEFL/ESL Teaching and Learning
MethodsSparse Evolutionary Training
