A Closer Look at Memorization in Deep Networks

Devansh Arpit; Stanis{\l}aw Jastrz\k{e}bski; Nicolas Ballas; David; Krueger; Emmanuel Bengio; Maxinder S. Kanwal; Tegan Maharaj; Asja Fischer,; Aaron Courville; Yoshua Bengio; Simon Lacoste-Julien

arXiv:1706.05394·stat.ML·July 4, 2017·357 cites

A Closer Look at Memorization in Deep Networks

Devansh Arpit, Stanis{\l}aw Jastrz\k{e}bski, Nicolas Ballas, David, Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer,, Aaron Courville, Yoshua Bengio, Simon Lacoste-Julien

PDF

Open Access 2 Repos

TL;DR

This paper investigates how deep neural networks memorize data, showing they prioritize simple patterns, and explores how regularization affects memorization of noise versus real data, challenging traditional capacity-based explanations.

Contribution

It reveals that dataset-dependent factors influence memorization and generalization, and demonstrates how regularization can selectively degrade noise memorization without harming real data performance.

Findings

01

Deep networks tend to learn simple patterns before memorizing noise.

02

Regularization like dropout can reduce noise memorization without affecting real data accuracy.

03

Effective capacity notions are insufficient to explain generalization in gradient-trained deep networks.

Abstract

We examine the role of memorization in deep learning, drawing connections to capacity, generalization, and adversarial robustness. While deep networks are capable of memorizing noise data, our results suggest that they tend to prioritize learning simple patterns first. In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data. Our analysis suggests that the notions of effective capacity which are dataset independent are unlikely to explain the generalization performance of deep networks when trained with gradient based methods because training data itself plays an important role in determining the degree of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning