Beyond Frequency: The Role of Redundancy in Large Language Model Memorization

Jie Zhang; Qinghua Zhao; Chi-ho Lin; Zhongfeng Kang; Lei Li

arXiv:2506.12321·cs.LG·September 1, 2025

Beyond Frequency: The Role of Redundancy in Large Language Model Memorization

Jie Zhang, Qinghua Zhao, Chi-ho Lin, Zhongfeng Kang, Lei Li

PDF

Open Access

TL;DR

This paper investigates how redundancy in training data influences memorization in large language models, revealing that low-redundancy samples are more memorable and fragile, which can inform privacy and fairness improvements.

Contribution

It uncovers the distinct response patterns of memorized versus non-memorized samples and links redundancy levels to memorization and vulnerability, offering new insights for data preprocessing.

Findings

01

79% of memorized samples are low-redundancy

02

Low-redundancy samples are twice as vulnerable as high-redundancy ones

03

Memorized samples drop significantly under perturbation, non-memorized do not

Abstract

Memorization in large language models poses critical risks for privacy and fairness as these systems scale to billions of parameters. While previous studies established correlations between memorization and factors like token frequency and repetition patterns, we revealed distinct response patterns: frequency increases minimally impact memorized samples (e.g. 0.09) while substantially affecting non-memorized samples (e.g., 0.25), with consistency observed across model scales. Through counterfactual analysis by perturbing sample prefixes and quantifying perturbation strength through token positional changes, we demonstrate that redundancy correlates with memorization patterns. Our findings establish that: about 79% of memorized samples are low-redundancy, these low-redundancy samples exhibit 2-fold higher vulnerability than high-redundancy ones, and consequently memorized samples drop by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education