Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning
Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Fenglong Ma

TL;DR
This paper introduces A$^3$Tune, a fine-tuning framework that improves attention alignment in medical vision-language models, leading to more accurate outputs by refining attention heads and adaptively tuning parameters.
Contribution
The paper presents a novel automatic attention alignment tuning method with an adaptive module, enhancing medical vision-language models without requiring extensive supervision.
Findings
Outperforms state-of-the-art baselines in medical VQA and report generation.
Improves attention distribution and model performance.
Effectively refines attention heads for better visual input understanding.
Abstract
Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose ATune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. ATune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a AMoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that ATune outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Segment Anything Model
