Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang; Lihua Zhou; Nianxin Li; Miao Xu; Ziyang Song; Dong Yi; Jinlin Wu; Jiawei Ma; Hongbin Liu; Zhen Lei; Jiebo Luo

arXiv:2508.05008·cs.CV·May 15, 2026

Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

Xusheng Liang, Lihua Zhou, Nianxin Li, Miao Xu, Ziyang Song, Dong Yi, Jinlin Wu, Jiawei Ma, Hongbin Liu, Zhen Lei, Jiebo Luo

PDF

TL;DR

This paper introduces MCDRL, a novel framework combining causal inference and vision-language models to improve medical image segmentation across diverse domains.

Contribution

It proposes a causal-driven approach leveraging CLIP and text prompts to identify and remove domain-specific confounders, enhancing generalization.

Findings

01

MCDRL outperforms existing methods in segmentation accuracy.

02

The framework demonstrates robust generalization across different medical imaging domains.

03

Extensive experiments validate the effectiveness of causal intervention in medical segmentation.

Abstract

Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains. To address this limitation, we propose Multimodal Causal-Driven Representation Learning (MCDRL), a novel framework that integrates causal inference with the VLM to tackle domain generalization in medical image segmentation. MCDRL is implemented in two steps: first, it leverages CLIP's cross-modal capabilities to identify candidate lesion regions and construct a confounder dictionary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.