Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement
Xin Huang, Junjie Liang, Qingshan Hou, Peng Cao, Jinzhu Yang, Xiaoli Liu, Osmar R. Zaiane

TL;DR
This paper introduces a novel framework for medical image generation that uses visual priors and semantic disentanglement to improve controllability and quality in text-to-image synthesis, addressing modality gaps and semantic entanglement issues.
Contribution
We propose a cross-modal latent alignment and hybrid feature fusion approach that enhances fine-grained control and generation quality in medical image synthesis.
Findings
Outperforms existing methods in generation quality
Improves downstream classification performance
Demonstrates effectiveness across three datasets
Abstract
Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications
