TL;DR
This paper introduces a learnability-guided diffusion method for dataset distillation that incrementally constructs synthetic datasets, reducing redundancy and improving performance on image classification benchmarks.
Contribution
It proposes a novel curriculum-based approach using learnability scores and diffusion models to generate more effective, less redundant synthetic datasets for training machine learning models.
Findings
Reduces dataset redundancy by 39.1%.
Achieves state-of-the-art results on ImageNet-1K with 60.1%.
Promotes specialization across training stages.
Abstract
Training machine learning models on massive datasets is expensive and time-consuming. Dataset distillation addresses this by creating a small synthetic dataset that achieves the same performance as the full dataset. Recent methods use diffusion models to generate distilled data, either by promoting diversity or matching training gradients. However, existing approaches produce redundant training signals, where samples convey overlapping information. Empirically, disjoint subsets of distilled datasets capture 80-90% overlapping signals. This redundancy stems from optimizing visual diversity or average training dynamics without accounting for similarity across samples, leading to datasets where multiple samples share similar information rather than complementary knowledge. We propose learnability-driven dataset distillation, which constructs synthetic datasets incrementally through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
