Synthetic Data Generation for Long-Tail Medical Image Classification: A Case Study in Skin Lesions
Jiaxiang Jiang, Mahesh Subedar, Omesh Tickoo

TL;DR
This paper presents a diffusion-model-based synthetic data augmentation method to improve long-tail skin lesion classification, significantly boosting performance on rare classes in imbalanced medical datasets.
Contribution
The authors introduce a novel diffusion inpainting model with OOD post-selection for diverse, realistic synthetic medical images, enhancing long-tail classification performance.
Findings
Achieved over 28% improvement on the rarest class in skin lesion dataset.
Significant overall performance gains demonstrated on imbalanced medical imaging data.
Diffusion-based augmentation outperforms existing methods in class imbalance scenarios.
Abstract
Long-tailed class distributions are pervasive in multi-class medical datasets and pose significant challenges for deep learning models which typically underperform on tail classes with limited samples. This limitation is particularly problematic in medical applications, where rare classes often correspond to severe or high-risk diseases and therefore require high diagnostic accuracy. Existing solutions-including specialized architectures, rebalanced loss functions, and handcrafted data augmentation-offer only marginal improvements and struggle to scale due to their limited and largely deterministic variability. To address these challenges, we introduce a diffusion-model-driven synthetic data augmentation pipeline tailored for medical long-tailed classification. Our approach features a novel inpainting diffusion model combined with an Out-of-Distribution (OOD) post-selection mechanism to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
