Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi

TL;DR
This paper introduces a novel 3D text-to-CT generation model that combines a latent diffusion approach with contrastive vision-language pretraining, enabling realistic and clinically relevant volumetric CT synthesis from textual descriptions.
Contribution
It presents a new architecture that integrates a 3D contrastive pretraining scheme with a latent diffusion model for high-quality, controllable text-to-CT generation.
Findings
Achieves competitive performance on the CT-RATE dataset.
Outperforms prior methods in fidelity and semantic alignment.
Enhances downstream diagnostic tasks with synthesized CT data.
Abstract
Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
