An Empirical Study of Memorization in NLP

Xiaosen Zheng; Jing Jiang

arXiv:2203.12171·cs.CL·March 24, 2022·1 cites

An Empirical Study of Memorization in NLP

Xiaosen Zheng, Jing Jiang

PDF

Open Access 1 Repo

TL;DR

This study empirically investigates memorization in NLP models across three tasks, confirming the long-tail theory and revealing that top-memorized instances are atypical and negatively correlated with class labels.

Contribution

It provides the first empirical verification of the long-tail memorization theory in NLP and introduces a novel attribution method to analyze memorization.

Findings

01

Top-memorized instances are often atypical.

02

Removing top-memorized instances significantly drops test accuracy.

03

Top-memorized features are negatively correlated with labels.

Abstract

A recent study by Feldman (2020) proposed a long-tail theory to explain the memorization behavior of deep learning models. However, memorization has not been empirically verified in the context of NLP, a gap addressed by this work. In this paper, we use three different NLP tasks to check if the long-tail theory holds. Our experiments demonstrate that top-ranked memorized training instances are likely atypical, and removing the top-memorized training instances leads to a more serious drop in test accuracy compared with removing training instances randomly. Furthermore, we develop an attribution method to better understand why a training instance is memorized. We empirically show that our memorization attribution method is faithful, and share our interesting finding that the top-memorized parts of a training instance tend to be features negatively correlated with the class label.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xszheng2020/memorization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification