Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

Changlu Guo; Anders Nymark Christensen; and Morten Rieger Hannemose

arXiv:2506.20449·cs.CV·June 26, 2025

Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation

Changlu Guo, Anders Nymark Christensen, and Morten Rieger Hannemose

PDF

Open Access

TL;DR

Med-Art introduces a diffusion transformer framework tailored for medical text-to-image generation, effectively addressing data scarcity and improving image quality through innovative fine-tuning methods.

Contribution

The paper presents Med-Art, a novel framework that adapts large-scale pre-trained models with hybrid-level diffusion fine-tuning for medical image synthesis with limited data.

Findings

01

Achieves state-of-the-art FID and KID scores on medical datasets.

02

Effectively handles small datasets and textual data scarcity.

03

Improves image quality and downstream classification performance.

Abstract

Text-to-image generative models have achieved remarkable breakthroughs in recent years. However, their application in medical image generation still faces significant challenges, including small dataset sizes, and scarcity of medical textual data. To address these challenges, we propose Med-Art, a framework specifically designed for medical image generation with limited data. Med-Art leverages vision-language models to generate visual descriptions of medical images which overcomes the scarcity of applicable medical textual data. Med-Art adapts a large-scale pre-trained text-to-image model, PixArt- $α$ , based on the Diffusion Transformer (DiT), achieving high performance under limited data. Furthermore, we propose an innovative Hybrid-Level Diffusion Fine-tuning (HLDF) method, which enables pixel-level losses, effectively addressing issues such as overly saturated colors. We achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Computer Graphics and Visualization Techniques

MethodsDropout · Dense Connections · Absolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Diffusion · Softmax · Transformer