DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Zekai Li; Xinhao Zhong; Samir Khaki; Zhiyuan Liang; Yuhao Zhou; Mingjia Shi; Ziqiao Wang; Xuanlei Zhao; Wangbo Zhao; Ziheng Qin; Mengxuan Wu; Pengfei Zhou; Haonan Wang; David Junhao Zhang; Jia-Wei Liu; Shaobo Wang; Dai Liu; Linfeng Zhang; Guang Li; Kun Wang; Zheng Zhu; Zhiheng Ma; Joey Tianyi Zhou; Jiancheng Lv; Yaochu Jin; Peihao Wang; Kaipeng Zhang; Lingjuan Lyu; Yiran Huang; Zeynep Akata; Zhiwei Deng; Xindi Wu; George Cazenavette; Yuzhang Shang; Justin Cui; Jindong Gu; Qian Zheng; Hao Ye; Shuo Wang; Xiaobo Wang; Yan Yan; Angela Yao; Mike Zheng Shou; Tianlong Chen; Hakan Bilen; Baharan Mirzasoleiman; Manolis Kellis; Konstantinos N. Plataniotis; Zhangyang Wang; Bo Zhao; Yang You; Kai Wang

arXiv:2505.13300·cs.CV·September 23, 2025

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

Zekai Li, Xinhao Zhong, Samir Khaki, Zhiyuan Liang, Yuhao Zhou, Mingjia Shi, Ziqiao Wang, Xuanlei Zhao, Wangbo Zhao, Ziheng Qin, Mengxuan Wu, Pengfei Zhou, Haonan Wang, David Junhao Zhang, Jia-Wei Liu, Shaobo Wang, Dai Liu, Linfeng Zhang, Guang Li, Kun Wang, Zheng Zhu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces DD-Ranking, a new evaluation framework for dataset distillation that addresses the limitations of accuracy as a metric, ensuring fairer and more meaningful comparisons of different methods.

Contribution

The paper proposes DD-Ranking, a unified evaluation framework with new metrics to better assess the true quality and performance of distilled datasets.

Findings

01

Existing evaluation metrics can be misleading, favoring techniques rather than data quality.

02

Randomly sampled images can outperform some advanced methods under current metrics.

03

DD-Ranking offers a more accurate assessment of dataset distillation methods.

Abstract

In recent years, dataset distillation has provided a reliable solution for data compression, where models trained on the resulting smaller synthetic datasets achieve performance comparable to those trained on the original datasets. To further improve the performance of synthetic datasets, various training pipelines and optimization objectives have been proposed, greatly advancing the field of dataset distillation. Recent decoupled dataset distillation methods introduce soft labels and stronger data augmentation during the post-evaluation phase and scale dataset distillation up to larger datasets (e.g., ImageNet-1K). However, this raises a question: Is accuracy still a reliable metric to fairly evaluate dataset distillation methods? Our empirical findings suggest that the performance improvements of these methods often stem from additional techniques rather than the inherent quality of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

* The paper is well-written with a clear, thorough, and concise introduction that effectively summarizes key points * The authors demonstrated through extensive experiments that the proposed evaluation metrics are meaningful and effective.

Weaknesses

* Discussion of limitations is lacking * Theoretical background would be needed

Reviewer 02Rating 6Confidence 4

Strengths

1. The paper clearly demonstrates that performance improvements in existing dataset distillation methods often result from knowledge distillation or data augmentation rather than from the informativeness of the synthetic images. 2. The proposed evaluation metrics for comparing the performance of different models are clearly defined. 3. The authors conduct experiments with LRS and ARS across different model architectures, teacher models, and hyperparameter settings to verify the robustness of the

Weaknesses

1. The paper spends excessive space analyzing the limitations of existing methods. This part is repetitive and should be condensed into a shorter empirical motivation section. 2. Although LRS and ARS are intuitively motivated, their theoretical foundation is weak and lacks conceptual depth. 3. In Section 3, the definition of DD RANKING is unclear. It is not specified whether it refers to LRS, ARS, or a combination of both. 4. In line 240, the description of the normalization method for computing

Reviewer 03Rating 4Confidence 4

Strengths

* The paper provides a meaningful attempt to standardize the evaluation of dataset distillation methods, enabling a more controlled comparison against random selection under matched label and augmentation setups.  * The results highlight interesting findings: under hard-label usage, matching-based DD methods remain stronger than recent soft-label–based approaches, suggesting that much of the improvement in newer methods (e.g., SRe2L) may stem from knowledge distillation rather than from the intr

Weaknesses

* Limited applicability of the metric: Although the proposed metrics allow comparisons under matched label/augmentation setups, they do not measure the ultimate achievable performance of each DD method under its best hyperparameter and setup choices. Since DD performance also depends on factors like architecture, optimizer, and training configuration, comparing distilled datasets only under uniform conditions offers limited insight into each method’s full potential.  * Ambiguous interpretability

Code & Models

Repositories

nus-hpc-ai-lab/dd-ranking
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Mining Algorithms and Applications · Big Data and Business Intelligence · Machine Learning and Data Classification