The Mosaic Memory of Large Language Models
Igor Shilov, Matthieu Meeus, Yves-Alexandre de Montjoye

TL;DR
This paper reveals that large language models memorize information not just through exact data repetition but by assembling similar, fuzzy sequences, a process called mosaic memory, which has significant implications for privacy and model evaluation.
Contribution
The study introduces the concept of mosaic memory in LLMs, demonstrating that they memorize through assembling similar sequences rather than just exact duplicates, challenging prior assumptions.
Findings
LLMs memorize via assembling similar sequences, not just exact data.
Fuzzy duplicates contribute significantly to memorization, up to 80%.
Memorization is mostly syntactic, not semantic.
Abstract
As Large Language Models (LLMs) become widely adopted, understanding how they learn from, and memorize, training data becomes crucial. Memorization in LLMs is widely assumed to only occur as a result of sequences being repeated in the training data. Instead, we show that LLMs memorize by assembling information from similar sequences, a phenomena we call mosaic memory. We show major LLMs to exhibit mosaic memory, with fuzzy duplicates contributing to memorization as much as 0.8 of an exact duplicate and even heavily modified sequences contributing substantially to memorization. Despite models display reasoning capabilities, we somewhat surprisingly show memorization to be predominantly syntactic rather than semantic. We finally show fuzzy duplicates to be ubiquitous in real-world data, untouched by deduplication techniques. Taken together, our results challenge widely held beliefs and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
