CardiCat: a Variational Autoencoder for High-Cardinality Tabular Data

Lee Carlin; Yuval Benjamini

arXiv:2501.17324·cs.LG·January 30, 2025

CardiCat: a Variational Autoencoder for High-Cardinality Tabular Data

Lee Carlin, Yuval Benjamini

PDF

Open Access 3 Reviews

TL;DR

CardiCat is a novel variational autoencoder designed to effectively model and generate high-cardinality, imbalanced tabular data using joint embedding layers, outperforming existing models in quality and scalability.

Contribution

It introduces a regularized dual encoder-decoder embedding approach that reduces parameters and improves modeling of complex categorical features in tabular data.

Findings

01

Generates high-quality synthetic data for complex categorical features.

02

Uses fewer parameters than competing models, enabling large-scale learning.

03

Outperforms existing VAEs in representing imbalanced high-cardinality data.

Abstract

High-cardinality categorical features are a common characteristic of mixed-type tabular datasets. Existing generative model architectures struggle to learn the complexities of such data at scale, primarily due to the difficulty of parameterizing the categorical features. In this paper, we present a general variational autoencoder model, CardiCat, that can accurately fit imbalanced high-cardinality and heterogeneous tabular data. Our method substitutes one-hot encoding with regularized dual encoder-decoder embedding layers, which are jointly learned. This approach enables us to use embeddings that depend also on the other covariates, leading to a compact and homogenized parameterization of categorical features. Our model employs a considerably smaller trainable parameter space than competing methods, enabling learning at a large scale. CardiCat generates high-quality synthetic data that…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

1. The paper is generally well-written. 2. The authors follow consistent notations throughout the paper. 3. The code is provided.

Weaknesses

**1. [Important] Seemingly inaccurate claim of contribution.** CardiCat does not seem to be the first to employ dual embeddings in tabular data generation. I would suggest the authors refer to some recent papers, like TabSyn [1], where the VAE is equipped with a trainable tokeniser as 1. CardiCat. **2. [Important] Incomprehensive comparison to benchmark methods.** The paper seems to only include some conventional VAE and GAN methods for comparison. However, there has been some recent work on ge

Reviewer 02Rating 5Confidence 3

Strengths

The authors introduced a novel to fit the imbalanced tabular data. The paper is easy to follow and understand. The results in the Table 2 shows better performance than other VAE based methods.

Weaknesses

Lack of state-of-the-art comparative methods. Most of the comparative methods are methods before (vae, tvae) 2019, while the most advanced methods are necessary. In Figure 3, the proposed model seems to have similar or worse performance than tGAN, especially for the marginal reconstruction. In Table 2, do you have any comparisons with tGAN?

Reviewer 03Rating 3Confidence 4

Strengths

- The method addresses a well-known and relevant problem. - The structure of the paper is well-organized.

Weaknesses

- The contribution appears technically minimal or lacks sufficient justification. - Certain theoretical aspects require further review and clarification. - The related work section provides only a high-level overview and omits several relevant references. - Additional baselines are needed to strengthen the empirical evidence supporting the contributions and demonstrate their significance.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction