Enhancing Medical Large Vision-Language Models via Alignment Distillation
Aofei Chang, Ting Wang, Fenglong Ma

TL;DR
This paper introduces MEDALIGN, a lightweight distillation framework that improves medical vision-language models by enhancing visual alignment and interpretability, leading to more accurate and grounded clinical outputs.
Contribution
MEDALIGN is a novel, simple distillation method that transfers visual alignment knowledge from CLIP to Med-LVLMs, addressing hallucination issues in medical applications.
Findings
Improves performance on medical report generation
Enhances interpretability and visual grounding
Consistently outperforms baseline models
Abstract
Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
