Unmasking Trees for Tabular Data

Calvin McCarter

arXiv:2407.05593·cs.LG·July 24, 2025

Unmasking Trees for Tabular Data

Calvin McCarter

PDF

Open Access 1 Repo 3 Reviews

TL;DR

UnmaskingTrees introduces a decision tree-based method for tabular data imputation and generation, achieving state-of-the-art results on benchmarks with efficiency and flexibility, and proposing a new probabilistic prediction approach.

Contribution

The paper presents UnmaskingTrees, a simple yet effective gradient-boosted decision tree method for tabular imputation and generation, and introduces BaltoBot for flexible conditional generation without parametric assumptions.

Findings

01

Leading performance on imputation benchmarks

02

State-of-the-art on data generation with missingness

03

Competitive results on vanilla generation

Abstract

Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. On a benchmark for out-of-the-box performance on 27 small tabular datasets, UnmaskingTrees offers leading performance on imputation; state-of-the-art performance on generation given data with missingness; and competitive performance on vanilla generation given data without missingness. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

- The main idea is simple and sensible. - The algorithm proposed for conditional density estimation over numerical variables, despite not being the main focus of the paper, shows promising results. I will note that while this algorithm might seem expensive, requiring training multiple binary classification models for predicting a single variable, it is in fact not so dissimilar to how a naive treatment as a multi-class categorical variable would be handled in XGBoost, requiring training one tre

Weaknesses

In my opinion this is a good paper with good ideas but certain aspects could be improved (roughly in order of importance): - The biggest downside of the proposed method would seem to be the need for duplicating the training set K*D times which could make it impratical for large datasets (and require selecting a lower K least the training set wouldn't fit in memory). The paper doesn't explore this regime and the trade-offs that would have to be made to scale to datasets with more examples and/o

Reviewer 02Rating 3Confidence 3

Strengths

- **Problem relevance**: There is a clear gap in the literature for methods that can handle tabular data imputations effectively, especially given that traditional methods like MissForest outperform newer deep learning approaches. - **Computational efficiency**: The method offers faster training and inference compared to diffusion-based approaches, which is valuable for practical applications. - **Novel combination**: While the individual components aren't new, combining gradient-boosted trees

Weaknesses

- **Incomplete technical analysis**: The paper fails to properly analyze how the method performs under different missingness mechanisms (MAR, MCAR, MNAR). This is a critical oversight for a paper focused on missing data. - **Superficial approach**: Although the method is a novel combination of existing techniques, it also lacks substantial theoretical innovation or justification. - **Oversold and unjustified claims**: The method is presented as a general solution, but its limitations and assum

Reviewer 03Rating 3Confidence 4

Strengths

- Simple to understand

Weaknesses

- Unclear how the proposed hierarchical handling of regression sampling is better than a simple discretized approach with well chosen bins - The idea is very simple and low-effort in terms of putting it to work EDIT: I want to take the above weakness back. - The results are not great. The tables feel a little to designed: boldening non-standard things in the table, i.e. the winner of a subset of the methods or the 2 best methods. I think if one wants to make a scientific contribution at ICLR l

Code & Models

Repositories

calvinmccarter/unmasking-trees
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Algorithms and Data Compression · Advanced Database Systems and Queries

MethodsDiffusion