Multisize Dataset Condensation
Yang He, Lingao Xiao, Joey Tianyi Zhou, Ivor Tsang

TL;DR
This paper introduces Multisize Dataset Condensation (MDC), a method that creates flexible, multi-sized condensed datasets in a single process, addressing subset degradation and resource constraints in on-device machine learning.
Contribution
MDC compresses multiple condensation processes into one, using an adaptive subset loss to produce datasets of various sizes without extra condensation steps.
Findings
Achieved 5.22%-6.40% accuracy gains on CIFAR-10 with 10 images per class
Validated on ConvNet, ResNet, DenseNet, SVHN, CIFAR-10/100, ImageNet
Reduces storage and computation for on-device dataset condensation
Abstract
While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to…
Peer Reviews
Decision·ICLR 2024 oral
The paper presents a solution for DC named Multisize Dataset Condensation which is crucial for on-device scenarios. The proposed method outperforms baseline C significantly.
1. The synthetic samples within the subset seem to be fixed, which may not reflect “Multisize Dataset Condensation” correctly.
1. The proposed Multisize Dataset Condensation (MDC) method can effectively condense the N condensation processes into a single condensation process with lower storage and addresses the “subset degradation problem”. 2. The adaptive subset loss in the MDC method helps mitigate the “subset degradation problem” and improves the accuracy of the condensed dataset compared to the Baseline-C. 3. The concept of the rate of change of feature distance as a substitute for the computationally expensive “gra
1. When the IPC (Inter-Process Communication) is small, there still exists a large accuracy gap between the proposed model and Baseline-A as shown in Figure 2 and Table 1. 2. The impact of the calculation interval (∆t) on the performance of the MDC method needs to be further analyzed to determine the optimal interval size.
Originality: This paper offers a unique approach to dataset condensation, aiming to cater to the specific needs of on-device scenarios. The proposal to compress N condensation processes into one is innovative. Quality: The "adaptive subset loss" is a novel concept, targeting the "subset degradation problem." The method to select the Most Learnable Subset (MLS) is well-thought-out and complex. Clarity: The paper is organized logically, and concepts are explained clearly. The use of terms like "ad
The paper explains three baselines for comparison. Compared to baseline A, the accuracy is not higher. Please explain the reason. Is it possible to reach Baseline A's accuracies? Equation 7 is not that clear. How to calculate the distance between the full dataset and subset?
Code & Models
Videos
Taxonomy
TopicsAdvanced Clustering Algorithms Research · Advanced Data Processing Techniques · Neural Networks and Applications
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Softmax · Dense Connections · Max Pooling · Kaiming Initialization · Batch Normalization · Global Average Pooling · Average Pooling · Dense Block
