Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang; Le Huang; Alex James Boyd; Parminder Bhatia; Taha Kass-Hout; Cao Xiao; Fenglong Ma

arXiv:2505.18503·cs.CV·May 27, 2025

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Fenglong Ma

PDF

Open Access 1 Video

TL;DR

This paper introduces A$^3$Tune, a fine-tuning framework that improves attention alignment in medical vision-language models, leading to more accurate outputs by refining attention heads and adaptively tuning parameters.

Contribution

The paper presents a novel automatic attention alignment tuning method with an adaptive module, enhancing medical vision-language models without requiring extensive supervision.

Findings

01

Outperforms state-of-the-art baselines in medical VQA and report generation.

02

Improves attention distribution and model performance.

03

Effectively refines attention heads for better visual input understanding.

Abstract

Medical Large Vision-Language Models (Med-LVLMs) often exhibit suboptimal attention distribution on visual inputs, leading to hallucinated or inaccurate outputs. Existing mitigation methods primarily rely on inference-time interventions, which are limited in attention adaptation or require additional supervision. To address this, we propose A $^{3}$ Tune, a novel fine-tuning framework for Automatic Attention Alignment Tuning. A $^{3}$ Tune leverages zero-shot weak labels from SAM, refines them into prompt-aware labels using BioMedCLIP, and then selectively modifies visually-critical attention heads to improve alignment while minimizing interference. Additionally, we introduce a A $^{3}$ MoE module, enabling adaptive parameter selection for attention tuning across diverse prompts and images. Extensive experiments on medical VQA and report generation benchmarks show that A $^{3}$ Tune outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Focus on What Matters: Enhancing Medical Vision-Language Models with Automatic Attention Alignment Tuning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Segment Anything Model