Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

Sumra Khan; Sagar Chhabriya; Aizan Zafar; Sheeraz Arif; Amgad Muneer; Anas Zafar; Shaina Raza; Rizwan Qureshi

arXiv:2604.08815·cs.CV·April 13, 2026

Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models

Sumra Khan, Sagar Chhabriya, Aizan Zafar, Sheeraz Arif, Amgad Muneer, Anas Zafar, Shaina Raza, Rizwan Qureshi

PDF

TL;DR

This paper proposes a context-aligned reasoning framework for medical vision-language models that improves diagnostic accuracy and reliability by enforcing agreement across heterogeneous clinical evidence.

Contribution

It introduces a novel framework that integrates structured contextual signals to enhance grounding, reduce hallucinations, and produce more trustworthy medical reasoning outputs.

Findings

01

Performance improved from AUC 0.918 to 0.925 on chest X-ray datasets.

02

Hallucinated keywords reduced from 1.14 to 0.25.

03

Reasoning explanations became more concise, from 19.4 to 15.3 words.

Abstract

Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.