Can pre-trained models assist in dataset distillation?
Yao Lu, Xuguang Chen, Yuchen Zhang, Jianyang Gu, Tianle Zhang, Yifan, Zhang, Xiaoniu Yang, Qi Xuan, Kai Wang, Yang You

TL;DR
This paper investigates how pre-trained models can enhance dataset distillation, showing that model diversity and domain matching improve synthetic dataset quality and cross-architecture generalization.
Contribution
It systematically studies the role of PTMs in dataset distillation, revealing that diverse and even sub-optimal models can aid DD, with domain match being important.
Findings
Increasing model diversity improves DD performance.
Sub-optimal models can outperform well-trained ones in DD.
Domain match is crucial, but domain-specific PTMs are not mandatory.
Abstract
Dataset Distillation (DD) is a prominent technique that encapsulates knowledge from a large-scale original dataset into a small synthetic dataset for efficient training. Meanwhile, Pre-trained Models (PTMs) function as knowledge repositories, containing extensive information from the original dataset. This naturally raises a question: Can PTMs effectively transfer knowledge to synthetic datasets, guiding DD accurately? To this end, we conduct preliminary experiments, confirming the contribution of PTMs to DD. Afterwards, we systematically study different options in PTMs, including initialization parameters, model architecture, training epoch and domain knowledge, revealing that: 1) Increasing model diversity enhances the performance of synthetic datasets; 2) Sub-optimal models can also assist in DD and outperform well-trained ones in certain cases; 3) Domain-specific PTMs are not…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The paper systematically studies different options in pre-trained models (PTMs), such as initialization parameters, model architecture, training epoch, and domain knowledge. And the influence of hyperparameters is also considered. This comprehensive analysis helps in understanding the influence of each factor on synthetic datasets individually. 2. The paper presents the findings and analysis in a clear and organized manner, making it easy for readers to follow and understand the experiments
The paper primarily focuses on empirical studies and experimental results. It would be valuable to provide more theoretical analysis and insights into why the proposed approach works, such as discussing the underlying principles or mathematical foundations. This paper investigates the application of pre-trained models (PTMs) in dataset distillation (DD) and their potential to enhance the performance of synthetic datasets. The authors introduce two novel loss terms, CLoM and CCLoM, to leverage P
This paper is well-motivated and they study an important problem.
1. A primary concern is the seemingly ambiguous use of the concept of Pre-trained Models (PTMs). There are very limited details on the data that are used for PTMs during the pretraining stage. The introduction and related work sections give the impression that pretraining follows the convention of using the ImageNet dataset, similar to methods they cited. However, I realize until section 5.2 when the discussion on pretraining across various epochs (1, 4, 6, 10...) is provided, that this is proba
1. This paper studies how the pre-trained models can help the dataset distillation from different perspectives. 2. This paper proposed two additional loss terms (CLoM and CCLoM) to incorporate PTM in the process.
1. The model diversification and whether or not the models need to be well-trained have been studied in CVPR 2023 paper, "Accelerating Dataset Distillation via Model Augmentation". This paper pointed out these two points in their paper. 2. The improvement with CCLoM (Table 2) is very marginal, the average gain only up to 1.35%.
1. The writing is good and the paper is easy to follow 2. The idea of leveraging PTMs in DD is well motivated and novel 3. The study of various options is impressive
1. Lack of analysis of the training time/storage cost of leveraging PTMs in DD, especially the domain-specific case. Pretraining models with different initialization and model architecture over the target dataset seems time-consuming (even with early-stopping). 2. The analysis of cross-architecture generalization. The reported number in table 4 seems inconsistent with that in table 3 of the original DM paper.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Domain Adaptation and Few-Shot Learning
