On Learning Representations for Tabular Data Distillation
Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani, Seneviratne

TL;DR
This paper introduces TDColER, a novel framework for tabular data distillation using column embeddings, and presents TDBench, a benchmark for evaluating such methods, demonstrating significant improvements in data quality across various models.
Contribution
The paper proposes TDColER, a new approach for tabular data distillation that addresses feature heterogeneity and non-differentiable models, and introduces TDBench, a comprehensive benchmark for evaluation.
Findings
TDColER improves data quality by up to 143% across models.
Created TDBench with 226,890 datasets and 548,880 models.
Significant enhancement over existing distillation schemes.
Abstract
Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present , a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, . Based on an elaborate evaluation on , resulting in 226,890 distilled datasets and 548,880…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Machine Learning and Data Classification · Data Mining Algorithms and Applications
MethodsSparse Evolutionary Training
