Dataset Distillation from First Principles: Integrating Core Information   Extraction and Purposeful Learning

Vyacheslav Kungurtsev; Yuanfang Peng; Jianyang Gu; Saeed Vahidian,; Anthony Quinn; Fadwa Idlahcen; Yiran Chen

arXiv:2409.01410·cs.LG·September 4, 2024

Dataset Distillation from First Principles: Integrating Core Information Extraction and Purposeful Learning

Vyacheslav Kungurtsev, Yuanfang Peng, Jianyang Gu, Saeed Vahidian,, Anthony Quinn, Fadwa Idlahcen, Yiran Chen

PDF

Open Access

TL;DR

This paper formalizes dataset distillation (DD) as an optimization problem tied to specific inference tasks, revealing its broad applications and limitations, and demonstrating its potential in medical data merging and physics-informed neural networks.

Contribution

It introduces a formal, task-specific model of DD, enabling better understanding and development of DD methods across diverse applications.

Findings

01

Formal model of DD tied to inference tasks

02

Analysis of DD methods' strengths and limitations

03

Numerical case studies in medical data and physics-informed neural networks

Abstract

Dataset distillation (DD) is an increasingly important technique that focuses on constructing a synthetic dataset capable of capturing the core information in training data to achieve comparable performance in models trained on the latter. While DD has a wide range of applications, the theory supporting it is less well evolved. New methods of DD are compared on a common set of benchmarks, rather than oriented towards any particular learning task. In this work, we present a formal model of DD, arguing that a precise characterization of the underlying optimization problem must specify the inference task associated with the application of interest. Without this task-specific focus, the DD problem is under-specified, and the selection of a DD algorithm for a particular task is merely heuristic. Our formalization reveals novel applications of DD across different modeling environments. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Data Classification

MethodsSparse Evolutionary Training