Image Classification Using a Diffusion Model as a Pre-Training Model
Kosuke Ukita, Ye Xiaolong, Tsuyoshi Okita

TL;DR
This paper introduces a diffusion model conditioned on representations from a Vision Transformer, enabling effective zero-shot image classification, especially in medical imaging, with significant performance improvements over existing methods.
Contribution
The paper presents a novel diffusion model that incorporates representation-conditioning from ViT, improving zero-shot classification without large labeled datasets.
Findings
Achieved +6.15% accuracy over DINOv2 in hematoma detection.
Demonstrated effective self-supervised learning for medical image classification.
Validated the model's superiority in zero-shot classification tasks.
Abstract
In this paper, we propose a diffusion model that integrates a representation-conditioning mechanism, where the representations derived from a Vision Transformer (ViT) are used to condition the internal process of a Transformer-based diffusion model. This approach enables representation-conditioned data generation, addressing the challenge of requiring large-scale labeled datasets by leveraging self-supervised learning on unlabeled data. We evaluate our method through a zero-shot classification task for hematoma detection in brain imaging. Compared to the strong contrastive learning baseline, DINOv2, our method achieves a notable improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its effectiveness in image classification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · Dropout · Layer Normalization · Contrastive Learning · Diffusion · Position-Wise Feed-Forward Layer
