Elephants Never Forget: Testing Language Models for Memorization of   Tabular Data

Sebastian Bordt; Harsha Nori; Rich Caruana

arXiv:2403.06644·cs.LG·March 12, 2024·1 cites

Elephants Never Forget: Testing Language Models for Memorization of Tabular Data

Sebastian Bordt, Harsha Nori, Rich Caruana

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models memorize tabular data, introduces methods to detect memorization and contamination, and highlights implications for data integrity and evaluation in machine learning.

Contribution

It presents novel techniques for assessing memorization in LLMs for tabular data and reveals the extent of data contamination affecting evaluation validity.

Findings

01

LLMs are often pre-trained on popular tabular datasets.

02

Models can reproduce data statistics without verbatim memorization.

03

Data contamination can lead to overestimated model performance.

Abstract

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Starting with simple qualitative tests for whether an LLM knows the names and values of features, we introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization. Our investigation reveals that LLMs are pre-trained on many popular tabular datasets. This exposure can lead to invalid performance evaluation on downstream tasks because the LLMs have, in effect, been fit to the test set. Interestingly, we also identify a regime where the language model reproduces important statistics of the data, but fails to reproduce the dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

interpretml/llm-tabular-memorization-checker
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Semantic Web and Ontologies

MethodsSparse Evolutionary Training