Beyond Frequency: The Role of Redundancy in Large Language Model Memorization
Jie Zhang, Qinghua Zhao, Chi-ho Lin, Zhongfeng Kang, Lei Li

TL;DR
This paper investigates how redundancy in training data influences memorization in large language models, revealing that low-redundancy samples are more memorable and fragile, which can inform privacy and fairness improvements.
Contribution
It uncovers the distinct response patterns of memorized versus non-memorized samples and links redundancy levels to memorization and vulnerability, offering new insights for data preprocessing.
Findings
79% of memorized samples are low-redundancy
Low-redundancy samples are twice as vulnerable as high-redundancy ones
Memorized samples drop significantly under perturbation, non-memorized do not
Abstract
Memorization in large language models poses critical risks for privacy and fairness as these systems scale to billions of parameters. While previous studies established correlations between memorization and factors like token frequency and repetition patterns, we revealed distinct response patterns: frequency increases minimally impact memorized samples (e.g. 0.09) while substantially affecting non-memorized samples (e.g., 0.25), with consistency observed across model scales. Through counterfactual analysis by perturbing sample prefixes and quantifying perturbation strength through token positional changes, we demonstrate that redundancy correlates with memorization patterns. Our findings establish that: about 79% of memorized samples are low-redundancy, these low-redundancy samples exhibit 2-fold higher vulnerability than high-redundancy ones, and consequently memorized samples drop by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education
