A Sobering Look at Tabular Data Generation via Probabilistic Circuits
Davide Scassola, Dylan Ponsford, Adri\'an Javaloy, Sebastiano Saccani, Luca Bortolussi, Henry Gouk, Antonio Vergari

TL;DR
This paper critically examines the current state-of-the-art in tabular data generation, revealing limitations in evaluation metrics and demonstrating that probabilistic circuits can outperform diffusion models at a lower cost.
Contribution
It challenges the perceived progress in tabular data generation by highlighting evaluation issues and proposing probabilistic circuits as a competitive alternative.
Findings
Probabilistic circuits outperform diffusion models in tabular data generation.
Current evaluation metrics are inadequate for assessing data fidelity.
There is significant room for improvement in generating realistic tabular data.
Abstract
Tabular data is more challenging to generate than text and images, due to its heterogeneous features and much lower sample sizes. On this task, diffusion-based models are the current state-of-the-art (SotA) model class, achieving almost perfect performance on commonly used benchmarks. In this paper, we question the perception of progress for tabular data generation. First, we highlight the limitations of current protocols to evaluate the fidelity of generated data, and advocate for alternative ones. Next, we revisit a simple baseline -- hierarchical mixture models in the form of deep probabilistic circuits (PCs) -- which delivers competitive or superior performance to SotA models for a fraction of the cost. PCs are the generative counterpart of decision forests, and as such can natively handle heterogeneous data as well as deliver tractable probabilistic generation and inference.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning and Algorithms · Machine Learning in Healthcare
