Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

Xin Huang; Junjie Liang; Qingshan Hou; Peng Cao; Jinzhu Yang; Xiaoli Liu; Osmar R. Zaiane

arXiv:2603.10519·cs.CV·March 12, 2026

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

Xin Huang, Junjie Liang, Qingshan Hou, Peng Cao, Jinzhu Yang, Xiaoli Liu, Osmar R. Zaiane

PDF

Open Access

TL;DR

This paper introduces a novel framework for medical image generation that uses visual priors and semantic disentanglement to improve controllability and quality in text-to-image synthesis, addressing modality gaps and semantic entanglement issues.

Contribution

We propose a cross-modal latent alignment and hybrid feature fusion approach that enhances fine-grained control and generation quality in medical image synthesis.

Findings

01

Outperforms existing methods in generation quality

02

Improves downstream classification performance

03

Demonstrates effectiveness across three datasets

Abstract

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Multimodal Machine Learning Applications