On the Size and Approximation Error of Distilled Sets

Alaa Maalouf; Murad Tukan; Noel Loo; Ramin Hasani; Mathias; Lechner; Daniela Rus

arXiv:2305.14113·cs.LG·May 24, 2023·1 cites

On the Size and Approximation Error of Distilled Sets

Alaa Maalouf, Murad Tukan, Noel Loo, Ramin Hasani, Mathias, Lechner, Daniela Rus

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of dataset distillation for kernel ridge regression, proving the existence of small distilled datasets with quantifiable excess risk, and establishing bounds on their approximation error.

Contribution

It offers the first theoretical guarantees on the size and error bounds of distilled datasets in kernel ridge regression using random Fourier features.

Findings

01

Small distilled datasets exist with size linear in RFF dimension or effective degrees of freedom.

02

The excess risk of distilled datasets can be bounded and depends on regularization.

03

Empirical verification supports the theoretical bounds.

Abstract

Dataset Distillation is the task of synthesizing small datasets from large ones while still retaining comparable predictive accuracy to the original uncompressed dataset. Despite significant empirical progress in recent years, there is little understanding of the theoretical limitations/guarantees of dataset distillation, specifically, what excess risk is achieved by distillation compared to the original dataset, and how large are distilled datasets? In this work, we take a theoretical view on kernel ridge regression (KRR) based methods of dataset distillation such as Kernel Inducing Points. By transforming ridge regression in random Fourier features (RFF) space, we provide the first proof of the existence of small (size) distilled datasets and their corresponding excess risk for shift-invariant kernels. We prove that a small set of instances exists in the original input space such that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Gaussian Processes and Bayesian Inference · Sparse and Compressive Sensing Techniques