Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

Delin An; Chaoli Wang

arXiv:2603.22509·cs.CV·March 25, 2026

Sketch2CT: Multimodal Diffusion for Structure-Aware 3D Medical Volume Generation

Delin An, Chaoli Wang

PDF

Open Access

TL;DR

Sketch2CT introduces a multimodal diffusion framework that generates anatomically consistent 3D medical volumes guided by user sketches and text, improving data augmentation and realism in medical imaging.

Contribution

It presents a novel structure-aware 3D medical volume generation method combining sketch and text guidance with a capsule-attention backbone.

Findings

01

Outperforms existing methods on public CT datasets

02

Enables controllable and efficient 3D volume synthesis

03

Produces anatomically accurate and realistic medical images

Abstract

Diffusion probabilistic models have demonstrated significant potential in generating high-quality, realistic medical images, providing a promising solution to the persistent challenge of data scarcity in the medical field. Nevertheless, producing 3D medical volumes with anatomically consistent structures under multimodal conditions remains a complex and unresolved problem. We introduce Sketch2CT, a multimodal diffusion framework for structure-aware 3D medical volume generation, jointly guided by a user-provided 2D sketch and a textual description that captures 3D geometric semantics. The framework initially generates 3D segmentation masks of the target organ from random noise, conditioned on both modalities. To effectively align and fuse these inputs, we propose two key modules that refine sketch features with localized textual cues and integrate global sketch-text representations.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Multimodal Machine Learning Applications