Unmasking Trees for Tabular Data
Calvin McCarter

TL;DR
UnmaskingTrees introduces a decision tree-based method for tabular data imputation and generation, achieving state-of-the-art results on benchmarks with efficiency and flexibility, and proposing a new probabilistic prediction approach.
Contribution
The paper presents UnmaskingTrees, a simple yet effective gradient-boosted decision tree method for tabular imputation and generation, and introduces BaltoBot for flexible conditional generation without parametric assumptions.
Findings
Leading performance on imputation benchmarks
State-of-the-art on data generation with missingness
Competitive results on vanilla generation
Abstract
Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. On a benchmark for out-of-the-box performance on 27 small tabular datasets, UnmaskingTrees offers leading performance on imputation; state-of-the-art performance on generation given data with missingness; and competitive performance on vanilla generation given data without missingness. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on…
Peer Reviews
Decision·Submitted to ICLR 2025
- The main idea is simple and sensible. - The algorithm proposed for conditional density estimation over numerical variables, despite not being the main focus of the paper, shows promising results. I will note that while this algorithm might seem expensive, requiring training multiple binary classification models for predicting a single variable, it is in fact not so dissimilar to how a naive treatment as a multi-class categorical variable would be handled in XGBoost, requiring training one tre
In my opinion this is a good paper with good ideas but certain aspects could be improved (roughly in order of importance): - The biggest downside of the proposed method would seem to be the need for duplicating the training set K*D times which could make it impratical for large datasets (and require selecting a lower K least the training set wouldn't fit in memory). The paper doesn't explore this regime and the trade-offs that would have to be made to scale to datasets with more examples and/o
- **Problem relevance**: There is a clear gap in the literature for methods that can handle tabular data imputations effectively, especially given that traditional methods like MissForest outperform newer deep learning approaches. - **Computational efficiency**: The method offers faster training and inference compared to diffusion-based approaches, which is valuable for practical applications. - **Novel combination**: While the individual components aren't new, combining gradient-boosted trees
- **Incomplete technical analysis**: The paper fails to properly analyze how the method performs under different missingness mechanisms (MAR, MCAR, MNAR). This is a critical oversight for a paper focused on missing data. - **Superficial approach**: Although the method is a novel combination of existing techniques, it also lacks substantial theoretical innovation or justification. - **Oversold and unjustified claims**: The method is presented as a general solution, but its limitations and assum
- Simple to understand
- Unclear how the proposed hierarchical handling of regression sampling is better than a simple discretized approach with well chosen bins - The idea is very simple and low-effort in terms of putting it to work EDIT: I want to take the above weakness back. - The results are not great. The tables feel a little to designed: boldening non-standard things in the table, i.e. the winner of a subset of the methods or the 2 best methods. I think if one wants to make a scientific contribution at ICLR l
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Algorithms and Data Compression · Advanced Database Systems and Queries
MethodsDiffusion
