Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation
Shaobo Wang, Yantai Yang, Qilong Wang, Kaixin Li, Linfeng Zhang, and Junchi Yan

TL;DR
This paper investigates the role of sample difficulty in dataset distillation, revealing that focusing on easier samples improves distilled dataset quality, and introduces a correction method that enhances various distillation techniques.
Contribution
It provides the first theoretical and empirical analysis of sample difficulty in dataset distillation and proposes a simple, effective correction method to improve distillation outcomes.
Findings
Prioritizing easier samples improves dataset quality.
Sample Difficulty Correction (SDC) enhances multiple distillation methods.
Theoretical extension of neural scaling laws to dataset distillation.
Abstract
Dataset Distillation (DD) aims to synthesize a small dataset capable of performing comparably to the original dataset. Despite the success of numerous DD methods, theoretical exploration of this area remains unaddressed. In this paper, we take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty. We begin by empirically examining sample difficulty, measured by gradient norm, and observe that different matching-based methods roughly correspond to specific difficulty tendencies. We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods. Our findings suggest that prioritizing the synthesis of easier samples from the original dataset can enhance the quality of distilled datasets, especially in low IPC (image-per-class) settings. Based on our empirical observations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsPruning
