Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks
Vikas Raunak, Arul Menezes

TL;DR
This paper introduces an inexpensive algorithm to detect and analyze extractive memorization in neural machine translation, revealing its impact on model reliability and proposing finetuning as a mitigation strategy.
Contribution
The work presents a novel, cost-effective method for identifying extractive memorization in constrained sequence generation tasks, specifically NMT, and explores its effects and mitigation.
Findings
Extractive memorization significantly affects NMT reliability.
The proposed algorithm effectively identifies memorized samples.
Finetuning can reduce memorization in models.
Abstract
Memorization presents a challenge for several constrained Natural Language Generation (NLG) tasks such as Neural Machine Translation (NMT), wherein the proclivity of neural models to memorize noisy and atypical samples reacts adversely with the noisy (web crawled) datasets. However, previous studies of memorization in constrained NLG tasks have only focused on counterfactual memorization, linking it to the problem of hallucinations. In this work, we propose a new, inexpensive algorithm for extractive memorization (exact training data generation under insufficient context) in constrained sequence generation tasks and use it to study extractive memorization and its effects in NMT. We demonstrate that extractive memorization poses a serious threat to NMT reliability by qualitatively and quantitatively characterizing the memorized samples as well as the model behavior in their vicinity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
