VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers
Marvin Seyfarth, Salman Ul Hassan Dar, Yannik Frisch, Philipp Wild, Norbert Frey, Florian Andr\'e, and Sandy Engelhardt

TL;DR
VolDiT introduces a fully transformer-based 3D diffusion model for medical image synthesis, offering improved global coherence, fidelity, and controllability over traditional U-Net based methods.
Contribution
It is the first to develop a purely transformer-based 3D diffusion model for volumetric medical image synthesis with structured control capabilities.
Findings
Outperforms state-of-the-art U-Net based models in global coherence and fidelity.
Enables precise spatial control via token-level conditioning.
Demonstrates superior controllability and scalability in 3D medical image generation.
Abstract
Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis
