TL;DR
This paper investigates whether memorizing irrelevant data is necessary for high-accuracy learning, showing that in certain natural problems, models must encode extensive information about training examples regardless of the algorithm or model class.
Contribution
It introduces natural prediction tasks demonstrating that all sufficiently accurate algorithms must memorize large amounts of irrelevant information, independent of the training method or model type.
Findings
Memorization of irrelevant data is necessary for high accuracy in certain tasks.
Results hold across different algorithms and model classes.
Experiments show successful attacks on classifiers exploiting memorization.
Abstract
Modern machine learning models are complex and frequently encode surprising amounts of information about individual inputs. In extreme cases, complex models appear to memorize entire input examples, including seemingly irrelevant information (social security numbers from text, for example). In this paper, we aim to understand whether this sort of memorization is necessary for accurate learning. We describe natural prediction problems in which every sufficiently accurate training algorithm must encode, in the prediction model, essentially all the information about a large subset of its training examples. This remains true even when the examples are high-dimensional and have entropy much higher than the sample size, and even when most of that information is ultimately irrelevant to the task at hand. Further, our results do not depend on the training algorithm or the class of models used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLogistic Regression
