Multisize Dataset Condensation

Yang He; Lingao Xiao; Joey Tianyi Zhou; Ivor Tsang

arXiv:2403.06075·cs.CV·April 16, 2024·2 cites

Multisize Dataset Condensation

Yang He, Lingao Xiao, Joey Tianyi Zhou, Ivor Tsang

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces Multisize Dataset Condensation (MDC), a method that creates flexible, multi-sized condensed datasets in a single process, addressing subset degradation and resource constraints in on-device machine learning.

Contribution

MDC compresses multiple condensation processes into one, using an adaptive subset loss to produce datasets of various sizes without extra condensation steps.

Findings

01

Achieved 5.22%-6.40% accuracy gains on CIFAR-10 with 10 images per class

02

Validated on ConvNet, ResNet, DenseNet, SVHN, CIFAR-10/100, ImageNet

03

Reduces storage and computation for on-device dataset condensation

Abstract

While dataset condensation effectively enhances training efficiency, its application in on-device scenarios brings unique challenges. 1) Due to the fluctuating computational resources of these devices, there's a demand for a flexible dataset size that diverges from a predefined size. 2) The limited computational power on devices often prevents additional condensation operations. These two challenges connect to the "subset degradation problem" in traditional dataset condensation: a subset from a larger condensed dataset is often unrepresentative compared to directly condensing the whole dataset to that smaller size. In this paper, we propose Multisize Dataset Condensation (MDC) by compressing N condensation processes into a single condensation process to obtain datasets with multiple sizes. Specifically, we introduce an "adaptive subset loss" on top of the basic condensation loss to…

Peer Reviews

Decision·ICLR 2024 oral

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper presents a solution for DC named Multisize Dataset Condensation which is crucial for on-device scenarios. The proposed method outperforms baseline C significantly.

Weaknesses

1. The synthetic samples within the subset seem to be fixed, which may not reflect “Multisize Dataset Condensation” correctly.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

1. The proposed Multisize Dataset Condensation (MDC) method can effectively condense the N condensation processes into a single condensation process with lower storage and addresses the “subset degradation problem”. 2. The adaptive subset loss in the MDC method helps mitigate the “subset degradation problem” and improves the accuracy of the condensed dataset compared to the Baseline-C. 3. The concept of the rate of change of feature distance as a substitute for the computationally expensive “gra

Weaknesses

1. When the IPC (Inter-Process Communication) is small, there still exists a large accuracy gap between the proposed model and Baseline-A as shown in Figure 2 and Table 1. 2. The impact of the calculation interval (∆t) on the performance of the MDC method needs to be further analyzed to determine the optimal interval size.

Reviewer 03Rating 8· accept, good paperConfidence 4

Strengths

Originality: This paper offers a unique approach to dataset condensation, aiming to cater to the specific needs of on-device scenarios. The proposal to compress N condensation processes into one is innovative. Quality: The "adaptive subset loss" is a novel concept, targeting the "subset degradation problem." The method to select the Most Learnable Subset (MLS) is well-thought-out and complex. Clarity: The paper is organized logically, and concepts are explained clearly. The use of terms like "ad

Weaknesses

The paper explains three baselines for comparison. Compared to baseline A, the accuracy is not higher. Please explain the reason. Is it possible to reach Baseline A's accuracies? Equation 7 is not that clear. How to calculate the distance between the full dataset and subset?

Code & Models

Repositories

he-y/multisize-dataset-condensation
pytorchOfficial

Videos

Multisize Dataset Condensation· slideslive

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Advanced Data Processing Techniques · Neural Networks and Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Softmax · Dense Connections · Max Pooling · Kaiming Initialization · Batch Normalization · Global Average Pooling · Average Pooling · Dense Block