Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman; Guy Kornowski; Xin Lyu

arXiv:2506.01855·cs.LG·October 29, 2025

Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman, Guy Kornowski, Xin Lyu

PDF

Open Access

TL;DR

This paper explores the fundamental trade-offs between data memorization and sample size in training models, establishing lower bounds on memorization requirements using strong data processing inequalities, with implications for privacy and learning efficiency.

Contribution

It introduces a novel approach linking data processing inequalities to memorization bounds and demonstrates these bounds in binary classification problems, extending prior work.

Findings

01

Memorization of Ω(d) bits is necessary with O(1) samples in d-dimensional problems.

02

Lower bounds decay as the number of samples increases, depending on the problem.

03

Simple algorithms nearly match the lower bounds, indicating tightness.

Abstract

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization's role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $Ω (d)$ bits of information about the training data need to be memorized when $O (1)$ $d$ -dimensional examples are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Neural Networks and Applications