MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

Shivang Chopra; Gabriela Sanchez-Rodriguez; Lingchao Mao; Andrew J Feola; Jing Li; Zsolt Kira

arXiv:2506.08356·cs.CV·June 12, 2025

MedMoE: Modality-Specialized Mixture of Experts for Medical Vision-Language Understanding

Shivang Chopra, Gabriela Sanchez-Rodriguez, Lingchao Mao, Andrew J Feola, Jing Li, Zsolt Kira

PDF

Open Access

TL;DR

MedMoE introduces a modality-specific mixture of experts framework that dynamically adapts visual representations for medical vision-language tasks, improving alignment and retrieval across diverse imaging modalities.

Contribution

The paper proposes MedMoE, a novel modular framework with a Mixture-of-Experts module conditioned on report type, enabling modality-specific visual feature extraction without additional supervision.

Findings

01

Enhanced alignment and retrieval performance across multiple medical imaging modalities.

02

Effective spatially adaptive attention to clinically relevant regions.

03

Improved generalization in medical vision-language understanding.

Abstract

Different medical imaging modalities capture diagnostic information at varying spatial resolutions, from coarse global patterns to fine-grained localized structures. However, most existing vision-language frameworks in the medical domain apply a uniform strategy for local feature extraction, overlooking the modality-specific demands. In this work, we present MedMoE, a modular and extensible vision-language processing framework that dynamically adapts visual representation based on the diagnostic context. MedMoE incorporates a Mixture-of-Experts (MoE) module conditioned on the report type, which routes multi-scale image features through specialized expert branches trained to capture modality-specific visual semantics. These experts operate over feature pyramids derived from a Swin Transformer backbone, enabling spatially adaptive attention to clinically relevant regions. This framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsLinear Layer · Dense Connections · Stochastic Depth · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Adam · Attention Is All You Need · Softmax · Swin Transformer · Label Smoothing