LaTable: Towards Large Tabular Models

Boris van Breugel; Jonathan Crabb\'e; Rob Davis; Mihaela van der; Schaar

arXiv:2406.17673·cs.LG·June 26, 2024

LaTable: Towards Large Tabular Models

Boris van Breugel, Jonathan Crabb\'e, Rob Davis, Mihaela van der, Schaar

PDF

Open Access 3 Reviews

TL;DR

LaTable introduces a novel diffusion-based model for generating tabular data, addressing heterogeneity and metadata challenges, and demonstrates superior in-distribution performance and potential for out-of-distribution data generation.

Contribution

LaTable is the first diffusion model designed specifically for tabular data, capable of training across diverse datasets and improving data generation quality.

Findings

01

Outperforms baselines on in-distribution data generation

02

Finetuning enhances out-of-distribution dataset generation with fewer samples

03

Zero-shot performance remains limited, indicating room for improvement

Abstract

Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 3

Strengths

- The authors provide a clear motivation for the need for tabular generative models and present a model designed to meet the specific requirements of tabular data generation. The related works section is well-written, highlighting why LLM-based approaches are not optimal compared to their diffusion-based approach. - The authors conducted comprehensive experiments, examining critical factors beyond general model performance, such as scalability, cross-dataset training procedures, and the impact o

Weaknesses

- The evaluation setup is unclear. The authors mention Cardio, URL, WiDS, Insurance, and Heloc as test datasets, citing Stoian et al. However, only URL, WiDS, and Heloc are covered in that paper; details on Cardio and Insurance datasets are not disclosed, and relevant citations for these datasets are missing. Additionally, the authors do not provide clear references to the baselines (e.g., it is unclear which papers CTGAN, TVAE, ARF, DDPM, and GREAT correspond to in L343). - The authors state th

Reviewer 02Rating 5Confidence 3

Strengths

The paper clearly outlines four primary design goals—cross-dataset generation, handling of categorical and numerical features, use of textual context, and column order equivariance—and effectively aligns these with specific model design choices. Additionally, it identifies scaling laws unique to tabular data, which is valuable given that this area has not been thoroughly explored within the scope of tabular foundation models.

Weaknesses

1. LaTable shows limited robustness on non-binary classification tasks, such as multi-class classification and regression, suggesting constrained generalization across different task types. 2. The descriptions of datasets and baseline models are brief and lack detail. 3. The evaluation metrics are limited, primarily focusing on downstream performance. 4. Figure 2 is oversized. 5. Although the paper acknowledges issues of data bias and fairness, it does not explore practical approaches to detecti

Reviewer 03Rating 3Confidence 2

Strengths

- The paper addresses the underexplored domain of large-scale tabular data modeling, a departure from traditional focus areas in foundation models such as text and vision. - The model meets several carefully formulated desiderata: cross-dataset generation, mixed-type handling, use of textual metadata, and equivariance to column order. The authors thoroughly answer each desiderata in their model design. - LaTable represents an important step toward creating generative models that can be applied t

Weaknesses

I mostly found it hard to understand the architecture of the model and the training objectives you used on it. I am from outside the tabular data community so this may be the reason, but I think that it should be clear to people outside the community as well. It is clear that you very carefully designed the architecture to meet all your requirements but I wasn't sure in the end what is the input/output, how you train everything end-to-end. A more higher level description is required instead of d

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Modeling in Geospatial Applications · Advanced Database Systems and Queries

MethodsDiffusion