Enhancing Medical Large Vision-Language Models via Alignment Distillation

Aofei Chang; Ting Wang; Fenglong Ma

arXiv:2512.18554·cs.CV·December 23, 2025

Enhancing Medical Large Vision-Language Models via Alignment Distillation

Aofei Chang, Ting Wang, Fenglong Ma

PDF

Open Access

TL;DR

This paper introduces MEDALIGN, a lightweight distillation framework that improves medical vision-language models by enhancing visual alignment and interpretability, leading to more accurate and grounded clinical outputs.

Contribution

MEDALIGN is a novel, simple distillation method that transfers visual alignment knowledge from CLIP to Med-LVLMs, addressing hallucination issues in medical applications.

Findings

01

Improves performance on medical report generation

02

Enhances interpretability and visual grounding

03

Consistently outperforms baseline models

Abstract

Medical Large Vision-Language Models (Med-LVLMs) have shown promising results in clinical applications, but often suffer from hallucinated outputs due to misaligned visual understanding. In this work, we identify two fundamental limitations contributing to this issue: insufficient visual representation learning and poor visual attention alignment. To address these problems, we propose MEDALIGN, a simple, lightweight alignment distillation framework that transfers visual alignment knowledge from a domain-specific Contrastive Language-Image Pre-training (CLIP) model to Med-LVLMs. MEDALIGN introduces two distillation losses: a spatial-aware visual alignment loss based on visual token-level similarity structures, and an attention-aware distillation loss that guides attention toward diagnostically relevant regions. Extensive experiments on medical report generation and medical visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques