Dataset Distillation Meets Provable Subset Selection

Murad Tukan; Alaa Maalouf; Margarita Osadchy

arXiv:2307.08086·cs.LG·July 18, 2023·1 cites

Dataset Distillation Meets Provable Subset Selection

Murad Tukan, Alaa Maalouf, Margarita Osadchy

PDF

Open Access

TL;DR

This paper introduces a provable, importance-based subset selection method to improve dataset distillation, reducing data redundancy and enhancing synthetic dataset quality for deep learning models.

Contribution

It presents a novel, theoretically grounded approach for initializing and training distilled datasets by identifying important data points, merging subset selection with distillation.

Findings

01

Improved dataset distillation performance on benchmark tasks.

02

Effective identification of important data points reduces redundancy.

03

Enhanced synthetic datasets maintain accuracy with fewer data samples.

Abstract

Deep learning has grown tremendously over recent years, yielding state-of-the-art results in various fields. However, training such models requires huge amounts of data, increasing the computational time and cost. To address this, dataset distillation was proposed to compress a large training dataset into a smaller synthetic one that retains its performance -- this is usually done by (1) uniformly initializing a synthetic set and (2) iteratively updating/learning this set according to a predefined loss by uniformly sampling instances from the full data. In this paper, we improve both phases of dataset distillation: (1) we present a provable, sampling-based approach for initializing the distilled set by identifying important and removing redundant points in the data, and (2) we further merge the idea of data subset selection with dataset distillation, by training the distilled set on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Anomaly Detection Techniques and Applications · Domain Adaptation and Few-Shot Learning