Dataset Distillation as Pushforward Optimal Quantization

Hong Ye Tan; Emma Slade

arXiv:2501.07681·cs.LG·February 9, 2026

Dataset Distillation as Pushforward Optimal Quantization

Hong Ye Tan, Emma Slade

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel dataset distillation method based on optimal quantization, linking it to classical mathematical problems, leading to improved performance and scalability in image datasets like ImageNet-1K.

Contribution

It reformulates disentangled dataset distillation as an optimal quantization problem, achieving state-of-the-art results with less computational effort.

Findings

01

Better performance than previous methods on ImageNet-1K.

02

Achieves state-of-the-art results with minimal additional computation.

03

Outperforms diffusion guidance methods in distillation tasks.

Abstract

Dataset distillation aims to find a synthetic training set such that training on the synthetic data achieves similar performance to training on real data, with orders of magnitude less computational requirements. Existing methods can be broadly categorized as either bi-level optimization problems that have neural network training heuristics as the lower level problem, or disentangled methods that bypass the bi-level optimization by matching distributions of data. The latter method has the major advantages of speed and scalability in terms of size of both training and distilled datasets. We demonstrate that when equipped with an encoder-decoder structure, the empirically successful disentangled methods can be reformulated as an optimal quantization problem, where a finite set of points is found to approximate the underlying probability measure by minimizing the expected projection…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- **Theoretical contribution**: The key theoretical contribution is the formal link between quantization theory, Wasserstein distance, and dataset distillation consistency. Theorem 1 demonstrates that score-based diffusion preserves the distributional closeness between the raw data and its quantized latent approximation, resulting in consistent gradient expectations during training on the distilled dataset. Corollary 1 further establishes asymptotic convergence rates $\mathcal{O}(K^{-1/d})$ for

Weaknesses

- **Heuristic implementation > theoretical analysis**: The core theoretical argument is that optimal quantization (non-uniform weights) is superior to finding the Wasserstein barycenter (uniform weights). However, the final implementation does not utilize these theoretically derived weights. Instead, it employs an ad-hoc heuristic (Eq. 34) output for "variance reduction." This choice is not justified by the theory and is not ablated. This undermines the central claim that the theory guides the m

Reviewer 02Rating 6Confidence 3

Strengths

1. Frames the dataset distillation problem as an optimal quantization problem and supports the suggested improvements with appropriate theoretical proofs. The theoretical formalization connects optimal quantization error with downstream expectation differences in the image domain, which helps bridge diffusion-based priors and representative selection in a mathematically grounded way. 2. Theorem 1 gives an explicit upper bound for the Wasserstein distance in latent space, which they show is rela

Weaknesses

1. The theoretical foundation simply builds upon existing proofs from quantization theory, so the novelty in theoretical contributions is limited. The actual improvements suggested are limited and incremental, as operationally, it simply adds weights to cluster centers. 2. A weighted loss is used while training the student models, which makes it unclear if the improvements are from better diffusion prior selection via clustering or better optimization of student model via weighted loss. An abla

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper introduces a new conceptual connection between dataset distillation and optimal quantization under the Wasserstein distance framework. 2. The authors propose to use the Wasserstein distance between the distilled latent representations and the original latent data distribution as a quantitative indicator of how well the synthetic dataset approximates the real data distribution. 3. The proposed method, DDOQ, improves performance in higher images-per-class (IPC) settings.

Weaknesses

1. The introduction looks like a related work and fails to clearly introduce the research gap and specific contributions of the paper. As a result, it is difficult for readers to understand what is novel about this work. 2. The proposed DDOQ method is a modification of D4M, replacing uniform clustering with a weighted k-means step. The overall pipeline (latent clustering, diffusion decoding, and weighted training) remains conceptually similar to previous methods. The framing of the approach as “

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Data Compression Techniques · CCD and CMOS Imaging Sensors

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training