MultiMedVision: Multi-Modal Medical Vision Framework

Frank Li; Bardia Khosravi; Mohammadreza Chavoshi; Young Seok Jeon; Theo Dapamede; Hari Trivedi; Janice Newsome; Judy Gichoya

arXiv:2605.09151·cs.CV·May 12, 2026

MultiMedVision: Multi-Modal Medical Vision Framework

Frank Li, Bardia Khosravi, Mohammadreza Chavoshi, Young Seok Jeon, Theo Dapamede, Hari Trivedi, Janice Newsome, Judy Gichoya

PDF

TL;DR

MultiMedVision introduces a unified multi-modal medical imaging framework that processes 2D and 3D data simultaneously using a Sparse Vision Transformer, achieving competitive results with less data.

Contribution

It presents a novel shared encoder architecture for joint 2D/3D medical image representation learning without modality-specific adapters.

Findings

01

Achieves 0.82 AUROC on MIMIC-CXR and 0.84 on CheXpert for 2D tasks.

02

Attains 0.85 AUROC on CT-RATE for 3D tasks.

03

Demonstrates shared and modality-specific features in learned representations.

Abstract

Multi-modal medical imaging enables comprehensive diagnostics, yet current foundation models process 2D (e.g. X-ray) and 3D (e.g. CT) data with separate, dimensionality-specific architectures. We present MultiMedVision, a unified framework for joint 2D/3D representation learning built on a Sparse Vision Transformer. Our model uses 3D Rotary Positional Embeddings and variable-length sequence packing to process mixed-modality batches natively within a shared latent space, without modality-specific adapters or treating 3D volumes as 2D slice sequences. Trained with a self-supervised objective on chest X-rays (MIMIC-CXR) and CT scans (CT-RATE), and using a single shared encoder with 5x less data, MultiMedVision achieves competitive performance on both 2D benchmarks (Macro AUROC 0.82 on MIMIC, 0.84 on CheXpert) and 3D tasks (0.85 on CT-RATE). Analysis of the learned representations reveals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.