Elephants Never Forget: Testing Language Models for Memorization of Tabular Data
Sebastian Bordt, Harsha Nori, Rich Caruana

TL;DR
This paper investigates how large language models memorize tabular data, introduces methods to detect memorization and contamination, and highlights implications for data integrity and evaluation in machine learning.
Contribution
It presents novel techniques for assessing memorization in LLMs for tabular data and reveals the extent of data contamination affecting evaluation validity.
Findings
LLMs are often pre-trained on popular tabular datasets.
Models can reproduce data statistics without verbatim memorization.
Data contamination can lead to overestimated model performance.
Abstract
While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies
MethodsSparse Evolutionary Training
