On Learning Representations for Tabular Data Distillation

Inwon Kang; Parikshit Ram; Yi Zhou; Horst Samulowitz; Oshani; Seneviratne

arXiv:2501.13905·cs.LG·January 24, 2025

On Learning Representations for Tabular Data Distillation

Inwon Kang, Parikshit Ram, Yi Zhou, Horst Samulowitz, Oshani, Seneviratne

PDF

Open Access

TL;DR

This paper introduces TDColER, a novel framework for tabular data distillation using column embeddings, and presents TDBench, a benchmark for evaluating such methods, demonstrating significant improvements in data quality across various models.

Contribution

The paper proposes TDColER, a new approach for tabular data distillation that addresses feature heterogeneity and non-differentiable models, and introduces TDBench, a comprehensive benchmark for evaluation.

Findings

01

TDColER improves data quality by up to 143% across models.

02

Created TDBench with 226,890 datasets and 548,880 models.

03

Significant enhancement over existing distillation schemes.

Abstract

Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present $TDColER$ , a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, $TDBench$ . Based on an elaborate evaluation on $TDBench$ , resulting in 226,890 distilled datasets and 548,880…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Machine Learning and Data Classification · Data Mining Algorithms and Applications

MethodsSparse Evolutionary Training