The Mosaic Memory of Large Language Models

Igor Shilov; Matthieu Meeus; Yves-Alexandre de Montjoye

arXiv:2405.15523·cs.CL·May 16, 2025·1 cites

The Mosaic Memory of Large Language Models

Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye

PDF

Open Access 1 Repo

TL;DR

This paper reveals that large language models memorize information not just through exact data repetition but by assembling similar, fuzzy sequences, a process called mosaic memory, which has significant implications for privacy and model evaluation.

Contribution

The study introduces the concept of mosaic memory in LLMs, demonstrating that they memorize through assembling similar sequences rather than just exact duplicates, challenging prior assumptions.

Findings

01

LLMs memorize via assembling similar sequences, not just exact data.

02

Fuzzy duplicates contribute significantly to memorization, up to 80%.

03

Memorization is mostly syntactic, not semantic.

Abstract

As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models display reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. Taken together, our results challenge widely held beliefs and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

computationalprivacy/mosaic_memory
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling