Image Classification Using a Diffusion Model as a Pre-Training Model

Kosuke Ukita; Ye Xiaolong; Tsuyoshi Okita

arXiv:2505.06890·cs.LG·May 13, 2025

Image Classification Using a Diffusion Model as a Pre-Training Model

Kosuke Ukita, Ye Xiaolong, Tsuyoshi Okita

PDF

Open Access

TL;DR

This paper introduces a diffusion model conditioned on representations from a Vision Transformer, enabling effective zero-shot image classification, especially in medical imaging, with significant performance improvements over existing methods.

Contribution

The paper presents a novel diffusion model that incorporates representation-conditioning from ViT, improving zero-shot classification without large labeled datasets.

Findings

01

Achieved +6.15% accuracy over DINOv2 in hematoma detection.

02

Demonstrated effective self-supervised learning for medical image classification.

03

Validated the model's superiority in zero-shot classification tasks.

Abstract

In this paper, we propose a diffusion model that integrates a representation-conditioning mechanism, where the representations derived from a Vision Transformer (ViT) are used to condition the internal process of a Transformer-based diffusion model. This approach enables representation-conditioned data generation, addressing the challenge of requiring large-scale labeled datasets by leveraging self-supervised learning on unlabeled data. We evaluate our method through a zero-shot classification task for hematoma detection in brain imaging. Compared to the strong contrastive learning baseline, DINOv2, our method achieves a notable improvement of +6.15% in accuracy and +13.60% in F1-score, demonstrating its effectiveness in image classification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Vision Transformer · Dropout · Layer Normalization · Contrastive Learning · Diffusion · Position-Wise Feed-Forward Layer