DD-Ranking: Rethinking the Evaluation of Dataset Distillation
Zekai Li, Xinhao Zhong, Samir Khaki, Zhiyuan Liang, Yuhao Zhou, Mingjia Shi, Ziqiao Wang, Xuanlei Zhao, Wangbo Zhao, Ziheng Qin, Mengxuan Wu, Pengfei Zhou, Haonan Wang, David Junhao Zhang, Jia-Wei Liu, Shaobo Wang, Dai Liu, Linfeng Zhang, Guang Li, Kun Wang, Zheng Zhu

TL;DR
This paper introduces DD-Ranking, a new evaluation framework for dataset distillation that addresses the limitations of accuracy as a metric, ensuring fairer and more meaningful comparisons of different methods.
Contribution
The paper proposes DD-Ranking, a unified evaluation framework with new metrics to better assess the true quality and performance of distilled datasets.
Findings
Existing evaluation metrics can be misleading, favoring techniques rather than data quality.
Randomly sampled images can outperform some advanced methods under current metrics.
DD-Ranking offers a more accurate assessment of dataset distillation methods.
Abstract
In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is well-written with a clear, thorough, and concise introduction that effectively summarizes key points * The authors demonstrated through extensive experiments that the proposed evaluation metrics are meaningful and effective.
* Discussion of limitations is lacking * Theoretical background would be needed
1. The paper clearly demonstrates that performance improvements in existing dataset distillation methods often result from knowledge distillation or data augmentation rather than from the informativeness of the synthetic images. 2. The proposed evaluation metrics for comparing the performance of different models are clearly defined. 3. The authors conduct experiments with LRS and ARS across different model architectures, teacher models, and hyperparameter settings to verify the robustness of the
1. The paper spends excessive space analyzing the limitations of existing methods. This part is repetitive and should be condensed into a shorter empirical motivation section. 2. Although LRS and ARS are intuitively motivated, their theoretical foundation is weak and lacks conceptual depth. 3. In Section 3, the definition of DD RANKING is unclear. It is not specified whether it refers to LRS, ARS, or a combination of both. 4. In line 240, the description of the normalization method for computing
* The paper provides a meaningful attempt to standardize the evaluation of dataset distillation methods, enabling a more controlled comparison against random selection under matched label and augmentation setups. * The results highlight interesting findings: under hard-label usage, matching-based DD methods remain stronger than recent soft-label–based approaches, suggesting that much of the improvement in newer methods (e.g., SRe2L) may stem from knowledge distillation rather than from the intr
* Limited applicability of the metric: Although the proposed metrics allow comparisons under matched label/augmentation setups, they do not measure the ultimate achievable performance of each DD method under its best hyperparameter and setup choices. Since DD performance also depends on factors like architecture, optimizer, and training configuration, comparing distilled datasets only under uniform conditions offers limited insight into each method’s full potential. * Ambiguous interpretability
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Big Data and Business Intelligence · Machine Learning and Data Classification
