Soft-Label Dataset Distillation and Text Dataset Distillation

Ilia Sucholutsky; Matthias Schonlau

arXiv:1910.02551·cs.LG·June 10, 2022

Soft-Label Dataset Distillation and Text Dataset Distillation

Ilia Sucholutsky, Matthias Schonlau

PDF

3 Repos

TL;DR

This paper introduces a novel dataset distillation method that uses soft labels for images and extends to text data, achieving higher accuracy with fewer samples and enabling multi-class encoding per sample.

Contribution

It proposes a soft-label dataset distillation algorithm for images and texts, improving accuracy and reducing sample requirements compared to prior hard-label methods.

Findings

01

Soft labels improve accuracy by 2-4% on image tasks.

02

Fewer distilled images achieve high accuracy, e.g., 96% on MNIST with 10 images.

03

Text distillation outperforms existing methods, reaching near-original accuracy with fewer sentences.

Abstract

Dataset distillation is a method for reducing dataset sizes by learning a small number of synthetic samples containing all the information of a large dataset. This has several benefits like speeding up model training, reducing energy consumption, and reducing required storage space. Currently, each synthetic sample is assigned a single `hard' label, and also, dataset distillation can currently only be used with image data. We propose to simultaneously distill both images and their labels, thus assigning each synthetic sample a `soft' label (a distribution of labels). Our algorithm increases accuracy by 2-4% over the original algorithm for several image classification tasks. Using `soft' labels also enables distilled datasets to consist of fewer samples than there are classes as each sample can encode information for multiple classes. For example, training a LeNet model with 10…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution · Dense Connections · LeNet