Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
Yang Yu, Dunyuan Xu, Yaoqian Li, Xiaomeng Li, Jinpeng Li, and Pheng-Ann Heng

TL;DR
This paper introduces a method to adapt 2D multimodal large language models for 3D medical image analysis, enhancing performance in medical report generation and visual question answering.
Contribution
It proposes transferring a 2D MLLM to 3D medical images and introduces a Text-Guided Hierarchical MoE framework with a two-stage training strategy.
Findings
Outperforms existing 3D medical MLLMs in MRG and MVQA tasks.
Effectively reuses pre-trained 2D MLLM parameters for 3D data.
Demonstrates improved task-specific feature extraction.
Abstract
3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
