Positional Fragility in LLMs: How Offset Effects Reshape Our Understanding of Memorization Risks
Yixuan Xu, Antoni-Joan Solergibert i Llaquet, Antoine Bosselut, Imanol Schlag

TL;DR
This paper investigates how positional offset affects memorization in large language models, revealing that reliance on early tokens causes fragility and that shifting data within the context window can reduce memorization risks.
Contribution
It introduces the concept of positional fragility, demonstrating how token position influences memorization and proposing a method to mitigate risks by adjusting data placement within the context window.
Findings
Verbatim memorization peaks with short prefixes at the start of the context window.
Memorization decreases as prefix length increases, contrary to prior assumptions.
Shifting sensitive data deeper into the context window reduces memorization and degeneration.
Abstract
Large language models are known to memorize parts of their training data, posing risk of copyright violations. To systematically examine this risk, we pretrain language models (1B/3B/8B) from scratch on 83B tokens, mixing web-scale data with public domain books used to simulate copyrighted content at controlled frequencies at lengths at least ten times longer than prior work. We thereby identified the offset effect, a phenomenon characterized by two key findings: (1) verbatim memorization is most strongly triggered by short prefixes drawn from the beginning of the context window, with memorization decreasing counterintuitively as prefix length increases; and (2) a sharp decline in verbatim recall when prefix begins offset from the initial tokens of the context window. We attribute this to positional fragility: models rely disproportionately on the earliest tokens in their context window…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Authorship Attribution and Profiling · Text Readability and Simplification
