Distribution-Aware Data Expansion with Diffusion Models
Haowei Zhu, Ling Yang, Jun-Hai Yong, Hongzhi Yin, Jiawei Jiang, Meng, Xiao, Wentao Zhang, Bin Wang

TL;DR
DistDiff is a distribution-aware diffusion model framework that enhances dataset expansion by generating distribution-consistent samples, improving model accuracy and robustness without requiring additional training.
Contribution
The paper introduces DistDiff, a novel, training-free data expansion method using hierarchical diffusion models that maintains distribution fidelity and outperforms existing synthesis techniques.
Findings
DistDiff improves accuracy across multiple datasets.
It outperforms existing synthesis-based data augmentation methods.
The expanded datasets enhance robustness across different model architectures.
Abstract
The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion techniques include image transformation and image synthesis methods. Transformation-based methods introduce only local variations, leading to limited diversity. In contrast, synthesis-based methods generate entirely new content, greatly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model.…
Peer Reviews
Decision·NeurIPS 2024 poster
The paper addresses a very important problem - How to perform effective data augmentation using synthetic data generation models. The approach does not require any training or fine-tuning of the diffusion models to adapt the generation to the required data distribution. The paper has detailed ablation studies for each design choices. Also, the experimental results suggest considerable improvement over the prior data expansion approaches.
1. There seems to be some confusion in the explanation. Section 3.3 (Transform Data Points) seems to suggest that the approach always starts with a sample from the dataset. However, the algorithm in the appendix does not include any sample from the dataset. 2. Is the approach extendable to other supervised learning tasks like segmentation, detection, etc? It looks to me like the augmentation approach is tailored for classification tasks only, with the usage of class-specific hierarchical protot
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Storage Technologies · Advanced Database Systems and Queries · Data Management and Algorithms
MethodsDiffusion
