Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Daniele Molino; Camillo Maria Caruso; Filippo Ruffini; Paolo Soda; Valerio Guarrasi

arXiv:2506.00633·cs.CV·October 2, 2025

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces a novel 3D text-to-CT generation model that combines a latent diffusion approach with contrastive vision-language pretraining, enabling realistic and clinically relevant volumetric CT synthesis from textual descriptions.

Contribution

It presents a new architecture that integrates a 3D contrastive pretraining scheme with a latent diffusion model for high-quality, controllable text-to-CT generation.

Findings

01

Achieves competitive performance on the CT-RATE dataset.

02

Outperforms prior methods in fidelity and semantic alignment.

03

Enhances downstream diagnostic tasks with synthesized CT data.

Abstract

Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
dmolino/text2ct-weights
model· 16 dl· ♡ 1
16 dl♡ 1

Datasets

dmolino/CT-RATE_Generated_Scans
dataset· 912 dl
912 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications