Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
Sucheng Ren, Xiaoke Huang, Xianhang Li, Junfei Xiao, Jieru Mei, Zeyu, Wang, Alan Yuille, Yuyin Zhou

TL;DR
Medical Vision Generalist (MVG) is a unified model capable of performing diverse medical imaging tasks across multiple modalities using an image-to-image generation framework, demonstrating superior performance and adaptability.
Contribution
This work introduces MVG, the first foundation model for medical imaging that unifies various tasks and modalities within a single flexible image generation framework.
Findings
MVG outperforms existing vision generalists on a comprehensive benchmark.
MVG's performance improves with more diverse training data.
MVG can adapt to unseen datasets with minimal task-specific data.
Abstract
This study presents Medical Vision Generalist (MVG), the first foundation model capable of handling various medical imaging tasks -- such as cross-modal synthesis, image segmentation, denoising, and inpainting -- within a unified image-to-image generation framework. Specifically, MVG employs an in-context generation strategy that standardizes the handling of inputs and outputs as images. By treating these tasks as an image generation process conditioned on prompt image-label pairs and input images, this approach enables a flexible unification of various tasks, even those spanning different modalities and datasets. To capitalize on both local and global context, we design a hybrid method combining masked image modeling with autoregressive training for conditional image generation. This hybrid approach yields the most robust performance across all involved medical imaging tasks. To…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The comprehensiveness of imaging modalities and tasks. The paper addresses segmentation, cross-modal synthesis, inpainting, and denoising tasks across various modalities, like CT, MRI, X-ray, and micro-ultrasound. High performance and generalization ability is remarkable. MVG outperforms SOTA models such as Painter and LVM in most metrics. It also demonstrates scalability with different datasets and adaptability to unseen datasets with minimal samples. Methodology of training - hybrid approach
As shown by authors, the hybrid use of autoregressive training boosts performance. However, it may impose higher computational costs during inference. The authors could provide a more detailed analysis of the trade-offs between performance and inference efficiency, especially when using MVG with heavy medical files, like the high resolution MRI. The ablation study lacks detailed insights into the contribution of individual components. For example, how do masked image modeling and autoregressive
1. The authors insightfully recognize that a pure masking strategy is insufficient for medical image segmentation, leading them to incorporate an autoregressive training pipeline. Experimental results confirm the effectiveness of this approach. 2. Unlike other foundational models focused primarily on segmentation, this paper addresses a broader range of tasks, including cross-modal synthesis, image denoising, and inpainting, opening potential new research directions.
1. While the addition of multiple tasks is beneficial, the paper overlooks essential medical imaging tasks, such as image registration and inverse reconstruction, making MVG appear more like an expanded segmentation model than a comprehensive foundation model. The reviewer suggests that MVG’s learned feature representations could potentially support image registration by integrating a flow estimation head, and inverse reconstruction by using denoising as a regularizer, unfolding the inverse opti
The paper first uses in-context learning to unify multiple medical vision tasks, which is original. The proposed output space unification strategy is useful when training with multiple segmentation tasks.
The advantage of unifying multiple medical vision tasks through the MVG model could not be verified based on the evidence provided in this paper. According to Table 2 and Table 5, the performance gain in segmentation tasks could be due to the colorization strategy instead of unifying other vision tasks. Regarding cross-modal synthesis, inpainting, and Denoising tasks, the improvement by the MVG model is marginal compared to the previous generalist model and all generalist models performs wors
- The paper is well written and easy to follow. - The paper proposes a solution to address the generalization of medical imaging analysis, such as cross-domain problem. - The paper proposes a unified colorization formulation to unify the different output types of medical imaging analysis tasks. - The paper treats the different learning tasks as a prompt-based learning task. - The paper introduces a new benchmark for generalist medical imaging analysis.
- The proposed framework is limited to 2D scenario. - The paper does not explain the clinical motivation why a generalized medical imaging analysis model is needed. To me, since there are many pre-trained specialized models, medical researchers can just pick one of the state-of-the-art models and get better performance than the generalized models. - The motivation of combining the masked imaging modeling and auto-aggressive is not clear. In the experiment, auto-aggressive training is more super
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiology practices and education
MethodsSparse Evolutionary Training · Inpainting
