XGenBoost: Synthesizing Small and Large Tabular Datasets with XGBoost
Jim Achterberg, Marcel Haas, Bram van Dijk, Marco Spruit

TL;DR
XGenBoost introduces two XGBoost-based generative models tailored for small and large tabular datasets, leveraging tree-based structures to improve data synthesis quality and efficiency over existing methods.
Contribution
The paper presents novel XGBoost-based generative models, including a diffusion and an autoregressive approach, optimized for mixed-type tabular data and outperforming prior models.
Findings
Outperforms previous neural- and tree-based models in tabular data synthesis
Effective for both small and large datasets with lower training costs
Leverages native categorical splits and hierarchical classifiers for mixed data types
Abstract
Tree ensembles such as XGBoost are often preferred for discriminative tasks in mixed-type tabular data, due to their inductive biases, minimal hyperparameter tuning, and training efficiency. We argue that these qualities, when leveraged correctly, can make for better generative models as well. As such, we present XGenBoost, a pair of generative models based on XGBoost: i) a Denoising Diffusion Implicit Model (DDIM) with XGBoost as score-estimator suited for smaller datasets, and ii) a hierarchical autoregressive model whose conditionals are learned via XGBoost classifiers, suited for large-scale tabular synthesis. The architectures follow from the natural constraints imposed by tree-based learners, e.g., in the diffusion model, combining Gaussian and multinomial diffusion to leverage native categorical splits and avoid one-hot encoding while accurately modelling mixed data types. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Evolutionary Algorithms and Applications · Tensor decomposition and applications
