Utility Boundary of Dataset Distillation: Scaling and Configuration-Coverage Laws
Zhengquan Luo, Zhiqiang Xu

TL;DR
This paper introduces a unified theoretical framework for dataset distillation, establishing laws that describe how distilled dataset size relates to performance, configuration diversity, and method interchangeability, supported by empirical validation.
Contribution
It provides the first comprehensive theoretical analysis of dataset distillation, including scaling and coverage laws, and clarifies the interchangeability of different matching methods.
Findings
Error decreases with larger distilled datasets following a scaling law.
Required sample size scales linearly with configuration diversity.
Matching methods are interchangeable surrogates reducing generalization error.
Abstract
Dataset distillation (DD) aims to construct compact synthetic datasets that allow models to achieve comparable performance to full-data training while substantially reducing storage and computation. Despite rapid empirical progress, its theoretical foundations remain limited: existing methods (gradient, distribution, trajectory matching) are built on heterogeneous surrogate objectives and optimization assumptions, which makes it difficult to analyze their common principles or provide general guarantees. Moreover, it is still unclear under what conditions distilled data can retain the effectiveness of full datasets when the training configuration, such as optimizer, architecture, or augmentation, changes. To answer these questions, we propose a unified theoretical framework, termed configuration--dynamics--error analysis, which reformulates major DD approaches under a common…
Peer Reviews
Decision·Submitted to ICLR 2026
1. How to better analyze existing dataset distillation algorithms into a unified theoretical framework is a very important problem to solve as many of the existing dataset distillation algorithms lack strong theoretical foundations. 2. The paper offers a large amount of theoretical derivations (44 out of the 51 pages of the paper)
1. The key formulation, the unified form of stability summarized by equation 6, is not well justified. Why does the absolute difference in the expected risk is approximately upper bounded by optimization residual + statistical fluctuations + matching term. The paper to motivate them from stability and information-theoretical approaches to generalization but none of the three cited work are relevant. The notion of mutual information is not mentioned anywhere in the prior text. 2. The quality and
Strengths: 1. **Theoretical novelty and unification**: The paper offers the first generalization-error-based–based framework that unifies disparate DD approaches under a common lens. This is a significant conceptual advance over prior paradigm-specific analyses. 2. **Clear and impactful theoretical results**: The scaling law explains the widely observed IPC saturation phenomenon, while the coverage law formally defines the “utility boundary” of distilled data across configuration shifts, a pra
--- **Weaknesses:** 1. **Limited experimental scale**: All experiments are conducted on relatively small-scale vision datasets (MNIST, CIFAR, ImageNette). The absence of evaluation on larger, more realistic benchmarks (e.g., ImageNet-1K or language datasets) raises concerns about the practical relevance and scalability of the derived laws in modern settings. 2. **Proxy for configuration diversity**: The paper approximates coverage complexity Hcov(A, r) by log M (M = number of configurations
1. The paper presents a unified bi-level generalization error framework that connects gradient, trajectory, and distribution-based DD, providing an important step toward a unified theoretical foundation for DD. 2. The proposed scaling and coverage laws formalize intuitive empirical observations into mathematically grounded relationships.
1. The framework relies on PL conditions and Lipschitz continuity. While these assumptions are standard in convergence analysis, they may not strictly hold for modern deep networks with non-smooth activations, normalization layers, and stochastic training components. The practical relevance of the theoretical results could be further clarified by discussing their validity under relaxed or empirically realistic assumptions. 2. The validation of the proposed laws relies mainly on curve-fitting wit
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Time Series Analysis and Forecasting · Stochastic Gradient Optimization Techniques
