When is Memorization of Irrelevant Training Data Necessary for   High-Accuracy Learning?

Gavin Brown; Mark Bun; Vitaly Feldman; Adam Smith; Kunal Talwar

arXiv:2012.06421·cs.LG·July 23, 2021

When is Memorization of Irrelevant Training Data Necessary for High-Accuracy Learning?

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, Kunal Talwar

PDF

1 Repo

TL;DR

This paper investigates whether memorizing irrelevant data is necessary for high-accuracy learning, showing that in certain natural problems, models must encode extensive information about training examples regardless of the algorithm or model class.

Contribution

It introduces natural prediction tasks demonstrating that all sufficiently accurate algorithms must memorize large amounts of irrelevant information, independent of the training method or model type.

Findings

01

Memorization of irrelevant data is necessary for high accuracy in certain tasks.

02

Results hold across different algorithms and model classes.

03

Experiments show successful attacks on classifiers exploiting memorization.

Abstract

Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gavinrbrown1/training-data-memorization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLogistic Regression