Boosting Medical Visual Understanding From Multi-Granular Language Learning
Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

TL;DR
This paper introduces MGLL, a novel contrastive learning framework that enhances medical visual understanding by aligning multi-label and multi-granularity textual descriptions with images, outperforming existing methods.
Contribution
The paper proposes MGLL, a multi-granular language learning approach that improves alignment in medical image-text models through structured supervision and cross-granularity consistency.
Findings
MGLL outperforms state-of-the-art methods on multiple medical datasets.
The framework effectively aligns multi-label and multi-granularity descriptions.
MGLL maintains computational efficiency as a plug-and-play module.
Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to…
Peer Reviews
Decision·ICLR 2026 Poster
The paper addresses an important and clinically relevant topic, multi-granular vision-language representation learning for medical imaging, which has strong potential to improve medical AI systems by aligning visual features with diagnostic text at different levels of detail.
- Although the paper mentions multi-label alignment as part of its motivation, it remains unclear how labels are defined within the framework, and what the exact relationship is between multi-label and multi-granularity. Does a “label” refer to a textual description at a specific granularity level? - Ambiguity in the Definition of Text Features in Point-wise Loss. It is not explicitly stated whether T_j refers to a single label at a given granularity level, or a concatenation of multiple labels
1. The paper has contributed two large-scale image-text pair datasets for fundus and X-ray images, providing multi-granular annotation, which will be very helpful to the field. 2. According to the evaluation and derivation, the proposed losses (soft CLIP loss, point-wise BCE loss, and KL loss) help improve the model's performance on multiple downstream evaluations, showing a uniform improvement against baselines. 3. The ablation experiment is especially detailed, providing strong support to th
My major concern is the clear formatting issue. The paper has clearly adjusted the vertical space between the section and sub-section titles, gaining more space for their content. Table 1 and Figure 4 overlap with each other. The reviewer believes that this is a violation of the conference policy, which suggests "Do not change any aspects of the formatting parameters in the style files. In particular, do not modify the width or length of the rectangle the text should fit into..." Considering tha
1. The overall contribution is clear and highly practical, offering significant utility for addressing the multi-level semantic complexity inherent in medical images. 2. Experiments are comprehensive and results are significant: 2.1. Covers multiple datasets (Fundus and X-ray) and 11 downstream tasks; 2.2. Validates various application scenarios, including linear probing, full fine-tuning, and integration with MLLMs; 2.3. Significantly outperforms existing CLIP variants on the majority of tasks,
1. The details of dataset construction are insufficient. 2. The innovation leans more towards a compositional approach, rather than proposing entirely new learning principles or optimization mechanisms. 3. Lacks ablation studies on the temperature coefficient (τ).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
