Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat; Mohammad Pezeshki; Florian Bordes; Michal; Drozdzal; Adriana Romero-Soriano

arXiv:2310.00158·cs.CV·September 11, 2024·2 cites

Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat, Mohammad Pezeshki, Florian Bordes, Michal, Drozdzal, Adriana Romero-Soriano

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a feedback-guided data augmentation framework that uses classifier feedback to generate synthetic samples, significantly improving classification performance on imbalanced datasets.

Contribution

The work presents a novel framework that incorporates classifier feedback into generative models to produce more useful synthetic data for imbalanced classification tasks.

Findings

01

Achieved over 4% improvement on underrepresented classes in ImageNet-LT.

02

Marked over 5% boost in worst group accuracy on NICO++.

03

Framework is twice as efficient in synthetic sample generation.

Abstract

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

● the experimental results were stunning, achieving state-of-the-art on ImageNet-LT ● the writing is clear and easy to follow ● the experiment is comprehensive, comparing three types of feedback criteria

Weaknesses

ImageNet-LT is essentially a pseudo long-tail dataset, where the tail classes may not necessarily be the minority in the actual data distribution. Therefore, generative models can sample relatively well. However, for real-world long-tail distributions, is it also difficult for generative models to obtain sufficiently good samples?

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The problem definition to encourage the generated samples to be helpful to the classifier, inspired by active learning frameworks, is novel. 2. The proposed method performs better than the previous sample synthesis-based imbalance classification methods.

Weaknesses

- The proposed solution for the problem definition is too naïve. For active learning methods, in addition to the confidence-based or entropy-based approach, margin margin-based approach is also possible. For the recent active learning criteria, such as BALD [1], VAAL [2], or MCDAL [3]. To claim the contribution of a complete research paper, the authors should devise an idea to leverage such recent active learning methods to find more novel solutions suitable for this problem. [1] Deep Bayesian A

Reviewer 03Rating 3· reject, not good enoughConfidence 2

Strengths

- Originality. The paper designs a diffusion model sampling strategy that uses the feedback of the pre-trained classifier to generate samples that help improve its own performance, which improves the classification performance to a certain extent. Has a certain degree of innovation. - Quality. The experimental design of the paper is reasonable, and the feasibility of the method is verified in ImageNet-LT and NICO++. - Clarity. The paper well-organized and clearly written. - Significance. The i

Weaknesses

1. The font format of the article is not uniform. Do the words in italics want to express any special meaning? Make it difficult for readers to read. 2. The charts are mixed up, for example, Figure 5. Is it a table or a graph? The sizes of some pictures also don’t match. 3. How about the time complexity of this method? 4. Are there more evaluation metrics to evaluate the performance of the proposed method versus the baseline method?

Code & Models

Repositories

facebookresearch/feedback-guided-data-synthesis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · COVID-19 diagnosis using AI · Digital Media Forensic Detection