Masked Language Modeling Becomes Conditional Density Estimation for Tabular Data Synthesis
Seunghwan An, Gyeongdong Woo, Jaesung Lim, ChangHyun Kim, Sungchul, Hong, Jong-June Jeon

TL;DR
This paper introduces MaCoDE, a novel method that transforms masked language modeling into a conditional density estimation technique for generating high-utility synthetic tabular data, effectively handling mixed data types and missing values.
Contribution
MaCoDE redefines MLM as histogram-based conditional density estimation, bridging distributional learning and MLM, and enabling flexible, privacy-aware synthetic data generation for tabular datasets.
Findings
Effective in generating high-utility synthetic data across 10 datasets.
Capable of adjusting privacy levels without re-training.
Handles missing data and imputations effectively.
Abstract
In this paper, our goal is to generate synthetic data for heterogeneous (mixed-type) tabular datasets with high machine learning utility (MLu). Since the MLu performance depends on accurately approximating the conditional distributions, we focus on devising a synthetic data generation method based on conditional distribution estimation. We introduce MaCoDE by redefining the consecutive multi-class classification task of Masked Language Modeling (MLM) as histogram-based non-parametric conditional density estimation. Our approach enables the estimation of conditional densities across arbitrary combinations of target and conditional variables. We bridge the theoretical gap between distributional learning and MLM by demonstrating that minimizing the orderless multi-class classification loss leads to minimizing the total variation distance between conditional distributions. To validate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
MethodsFocus
