Evaluating Latent Knowledge of Public Tabular Datasets in Large Language Models
Matteo Silvestri, Fabiano Veglianti, Flavio Giorgi, Fabrizio Silvestri, Gabriele Tolomei

TL;DR
This paper introduces a framework to detect dataset contamination in large language models' performance on tabular data, revealing significant contamination in several widely used datasets and raising concerns about evaluation reliability.
Contribution
It proposes a novel method for assessing contamination in tabular datasets using controlled queries and transformations, with statistical testing to detect significant deviations.
Findings
Contamination detected in 4 out of 8 datasets analyzed.
Performance inflation due to contamination may affect evaluation reliability.
Framework enables systematic detection of dataset contamination in LLMs.
Abstract
Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely unexplored. Existing approaches primarily rely on memorization tests, which are too coarse to detect contamination. In contrast, we propose a framework for assessing contamination in tabular datasets by generating controlled queries and performing comparative evaluation. Given a dataset, we craft multiple-choice aligned queries that preserve task structure while allowing systematic transformations of the underlying data. These transformations are designed to selectively disrupt dataset information while preserving partial knowledge, enabling us to isolate performance attributable to contamination. We complement this setup with non-neural baselines that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
