Cross-table Synthetic Tabular Data Detection
G. Charbel N. Kindji (LACODAM), Lina Maria Rojas-Barahona, Elisa, Fromont (LACODAM), Tanguy Urvoy

TL;DR
This paper investigates the challenge of detecting synthetic tabular data across diverse datasets and generators, proposing baseline methods and evaluation protocols to assess the difficulty of cross-table detection in real-world scenarios.
Contribution
It introduces three baseline detectors and four evaluation protocols to study the problem of cross-table synthetic data detection in varied and realistic settings.
Findings
Cross-table detection remains a challenging task.
Baseline detectors show limited effectiveness in 'wild' scenarios.
Evaluation protocols highlight the variability in detection difficulty.
Abstract
Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified ''in the wild''-meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of ''wildness''. Our very preliminary results confirm that cross-table adaptation is a challenging task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Currency Recognition and Detection
