Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted   Phenomenon

USVSN Sai Prashanth; Alvin Deng; Kyle O'Brien; Jyothir S V; Mohammad; Aflah Khan; Jaydeep Borkar; Christopher A. Choquette-Choo; Jacob Ray Fuehne,; Stella Biderman; Tracy Ke; Katherine Lee; Naomi Saphra

arXiv:2406.17746·cs.CL·May 9, 2025

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

USVSN Sai Prashanth, Alvin Deng, Kyle O'Brien, Jyothir S V, Mohammad, Aflah Khan, Jaydeep Borkar, Christopher A. Choquette-Choo, Jacob Ray Fuehne,, Stella Biderman, Tracy Ke, Katherine Lee, Naomi Saphra

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper proposes a nuanced taxonomy of memorization in language models, categorizing it into recitation, reconstruction, and recollection, and demonstrates how different factors influence memorization likelihood across these categories.

Contribution

It introduces a detailed taxonomy of memorization in language models and develops a predictive model to analyze how various factors affect memorization.

Findings

01

Different factors influence memorization depending on the category.

02

A predictive model can effectively classify types of memorization.

03

Analysis reveals distinct dependencies for each memorization category.

Abstract

Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

eleutherai/semantic-memorization
pytorchOfficial

Videos

Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon· slideslive

Taxonomy

TopicsEFL/ESL Teaching and Learning

MethodsSparse Evolutionary Training