Self-Supervised Dataset Distillation for Transfer Learning
Dong Bok Lee, Seanie Lee, Joonho Ko, Kenji Kawaguchi, Juho Lee, Sung, Ju Hwang

TL;DR
This paper introduces a novel dataset distillation method tailored for self-supervised learning, producing small synthetic datasets that effectively facilitate transfer learning without requiring labels.
Contribution
The paper proposes a new approach to distill unlabeled datasets into synthetic samples optimized for self-supervised pre-training, addressing bias issues in gradient estimation and reducing computational costs.
Findings
Effective in transfer learning scenarios
Reduces dataset size while maintaining performance
Addresses bias in gradient estimation for SSL
Abstract
Dataset distillation methods have achieved remarkable success in distilling a large dataset into a small set of representative samples. However, they are not designed to produce a distilled dataset that can be effectively used for facilitating self-supervised pre-training. To this end, we propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL). We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is \textit{biased} due to the randomness originating from data augmentations or masking. To address this issue, we propose to minimize the mean squared error (MSE) between a model's representations of the synthetic examples and their corresponding learnable target feature representations for the inner objective, which does not introduce any…
Peer Reviews
Decision·ICLR 2024 poster
1. Interesting and novel problem setting: self-supervised dataset distillation for transfer learning, which might produce task-agnostic condensed data and boost transferability. 2. Theorem contribution: a gradient of the SSL objectives with data augmentations or masking inputs is a biased estimator of the true gradient. And provide detailed proof. 3. Interesting experiments that outperform the supervised distillation method with self-supervised learning.
1. What is the motivation for minimizing MSE between the original data representation of the model from inner loop and that of the model pre-trained on the original dataset? 2. Why self-supervised learning method is better than the supervised method in this problem? I only see the empirical results, could you provide more explanation? Update: I have read the responses and my concerns are partially addressed. The authors did not provide more explanation than empirical results (It might be a mo
1) The problem that this paper focused on is somewhat new to the dataset distillation community; 2) The presentation and writing of this paper is coherent, the idea is easy to follow.
1) This paper dose NOT choose the state-of-the-art baselines in dataset distillation for comparison, such as IDC, IDM, etc. 2) The authors only provide the experimental results of transfer learning, but did NOT provide the test accuracy of the model trained barely on the distilled dataset. This makes me wondering if the distilled images can keep enough information compared to the images distilled by other baselines, or is the proposed method only performs well in the scenario of transfer learnin
The paper is overall well written and easy to follow. The method is relatively well-motivated, with the scenario when one wants to train many different architectures to find the best one for mobile/resource constrained deployment. The experiments cover a broad range of source, and transfer data sets, and many different network architectures are considered. The method is novel as far as I can tell (although I’m not an expert on dataset distillation).
Despite the overall positive impression, I see several weaknesses: - In the experiments the authors distill the data sets into 1000-2000 examples, for self-supervised learning, without augmentation. The authors do not comment on augmentations when training on the distilled data. This approach might work for the small models and low resolution used in the experiments, but I’m not convinced that it generalizes to larger models, more complex data sets and higher resolution. Data augmentation is a c
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Cancer-related molecular mechanisms research · Machine Learning and Data Classification
