Causal Estimation of Memorisation Profiles

Pietro Lesci; Clara Meister; Thomas Hofmann; Andreas Vlachos; Tiago; Pimentel

arXiv:2406.04327·cs.LG·October 17, 2024

Causal Estimation of Memorisation Profiles

Pietro Lesci, Clara Meister, Thomas Hofmann, Andreas Vlachos, Tiago, Pimentel

PDF

Open Access 1 Repo 5 Datasets

TL;DR

This paper introduces a new efficient method to estimate memorisation in language models, enabling analysis of how memorisation develops during training and its dependence on factors like model size, data order, and learning rate.

Contribution

It proposes a principled, computationally efficient approach to measure memorisation at the model instance level using a difference-in-differences design.

Findings

01

Larger models exhibit stronger and more persistent memorisation.

02

Memorisation is influenced by data order and learning rate.

03

Memorisation trends are stable across different model sizes.

Abstract

Understanding memorisation in language models has practical and societal implications, e.g., studying models' training dynamics or preventing copyright infringements. Prior work defines memorisation as the causal effect of training with an instance on the model's ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the model not seen that instance. Existing methods struggle to provide computationally efficient and accurate estimates of this counterfactual. Further, they often estimate memorisation for a model architecture rather than for a specific model instance. This paper fills an important gap in the literature, proposing a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics. Using this method, we characterise a model's memorisation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pietrolesci/memorisation-profiles
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications

MethodsSparse Evolutionary Training · Pythia