Elephants Never Forget: Memorization and Learning of Tabular Data in   Large Language Models

Sebastian Bordt; Harsha Nori; Vanessa Rodrigues; Besmira Nushi; Rich; Caruana

arXiv:2404.06209·cs.LG·December 5, 2024·2 cites

Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models

Sebastian Bordt, Harsha Nori, Vanessa Rodrigues, Besmira Nushi, Rich, Caruana

PDF

Open Access 1 Repo

TL;DR

This paper investigates how large language models memorize and learn from tabular data, revealing their tendencies to overfit on seen datasets and their limited sample efficiency on statistical tasks.

Contribution

It introduces techniques to detect memorization of tabular data in LLMs and compares their performance on seen versus unseen datasets, highlighting overfitting and robustness issues.

Findings

01

LLMs memorize many popular tabular datasets verbatim.

02

Performance is better on datasets seen during training, indicating overfitting.

03

LLMs show robustness to data transformations and have limited sample efficiency.

Abstract

While many have shown how Large Language Models (LLMs) can be applied to a diverse set of tasks, the critical issues of data contamination and memorization are often glossed over. In this work, we address this concern for tabular data. Specifically, we introduce a variety of different techniques to assess whether a language model has seen a tabular dataset during training. This investigation reveals that LLMs have memorized many popular tabular datasets verbatim. We then compare the few-shot learning performance of LLMs on datasets that were seen during training to the performance on datasets released after training. We find that LLMs perform better on datasets seen during training, indicating that memorization leads to overfitting. At the same time, LLMs show non-trivial performance on novel datasets and are surprisingly robust to data transformations. We then investigate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

interpretml/llm-tabular-memorization-checker
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training