Mitigating Bias in Dataset Distillation

Justin Cui; Ruochen Wang; Yuanhao Xiong; Cho-Jui Hsieh

arXiv:2406.06609·cs.LG·July 11, 2024

Mitigating Bias in Dataset Distillation

Justin Cui, Ruochen Wang, Yuanhao Xiong, Cho-Jui Hsieh

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper investigates how biases in original datasets affect dataset distillation, revealing bias amplification issues and proposing a reweighting method to mitigate bias, significantly improving model performance on biased datasets.

Contribution

It introduces a simple reweighting scheme based on kernel density estimation to reduce bias amplification during dataset distillation.

Findings

01

Biases like color and background are amplified through distillation.

02

The proposed reweighting method effectively reduces bias amplification.

03

Significant performance improvements on biased datasets, e.g., 67.7% accuracy boost on CMNIST.

Abstract

Dataset Distillation has emerged as a technique for compressing large datasets into smaller synthetic counterparts, facilitating downstream training tasks. In this paper, we study the impact of bias inside the original dataset on the performance of dataset distillation. With a comprehensive empirical evaluation on canonical datasets with color, corruption and background biases, we found that color and background biases in the original dataset will be amplified through the distillation process, resulting in a notable decline in the performance of models trained on the distilled dataset, while corruption bias is suppressed through the distillation process. To reduce bias amplification in dataset distillation, we introduce a simple yet highly effective approach based on a sample reweighting scheme utilizing kernel density estimation. Empirical results on multiple real-world and synthetic…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The storyline is relatively clear; it is easy for the authors to follow. The experimental results are quite good, especially in the large setting (IPC 50). The focused problem is interesting, and the method of solving it is quite simple.

Weaknesses

The types and importance of bias: In this paper, the author investigates three types of bias (color, background, and corrupted data). I wonder if there exist other types of bias. In real datasets, bias may be mixed and complex. Does your method still work in such settings? The time complexity of your method: Though the method is quite simple, the additional operation like KDE comes with a cost. How much time complexity does your method introduce? The baselines are limited. please refer to https:

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1) The problem is well-motivated. The studied topic is relevant to the community. 2) They first consider the weaknesses of existing dataset distillation algorithms with complex/anomalous training datasets.

Weaknesses

1) A sample reweighting scheme utilizing kernel density estimation is not novel to me. Overcoming the bias problem with a reweighting strategy is a well-known approach in the traditional ML paradigms. 2) The kernel density estimation reweighting-based approach is not scalable. This work only shows experiments on three toy datasets. The performance on the popular benchmark datasets, i.e., CIFAR100 and ImageNet tiny, are not reported in this work. 3) It would be great if the authors could show t

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

S1: To the best knowledge, this is the first work investigating the impact of bias within the original dataset on the dataset distillation. Additionally, their empirical observations suggest that the impact of two specific biases including colour and background will be amplified during the distillation process, which may provide some new insight for the community. S2: The paper is well-organized and presents a clear narrative.

Weaknesses

W1: It appears that the authors are trying to convince that their sample reweighting strategy can be utilized as a plug-and-play scheme for the matching-based dataset distillation method. However, they only test this strategy by combining with mere two earlier arts including distribution and gradient matching methods, providing insufficient evidence to prove the generalization ability and thus limiting the reliability. W2: The method proposed in this paper seems to excel only on very simple d

Reviewer 04Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The paper proposed a novel analysis of the influence of bias in the context of distillation. - The proposed approach is intuitive and enhances the robustness significantly.

Weaknesses

- The experimental benchmarks are limited to the synthetic datasets. In recent research, several real-world datasets have been employed for evaluation, such as CelebA-HQ, IMDB, and BFFHQ. - The logic from Section 3 to Section 4 seems somewhat disjointed to me. Specifically, it's not evident how the concepts of 'bias amplification' and 'bias suppression' are factored into the design of the reweighting scheme. - While the paper did assess performance against various distillation methods, I believ

Code & Models

Repositories

Guang000/Awesome-Dataset-Distillation
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification