Towards Responsible Multimodal Medical Reasoning via Context-Aligned Vision-Language Models
Sumra Khan, Sagar Chhabriya, Aizan Zafar, Sheeraz Arif, Amgad Muneer, Anas Zafar, Shaina Raza, Rizwan Qureshi

TL;DR
This paper proposes a context-aligned reasoning framework for medical vision-language models that improves diagnostic accuracy and reliability by enforcing agreement across heterogeneous clinical evidence.
Contribution
It introduces a novel framework that integrates structured contextual signals to enhance grounding, reduce hallucinations, and produce more trustworthy medical reasoning outputs.
Findings
Performance improved from AUC 0.918 to 0.925 on chest X-ray datasets.
Hallucinated keywords reduced from 1.14 to 0.25.
Reasoning explanations became more concise, from 19.4 to 15.3 words.
Abstract
Medical vision-language models (VLMs) show strong performance on radiology tasks but often produce fluent yet weakly grounded conclusions due to over-reliance on a dominant modality. We introduce a context-aligned reasoning framework that enforces agreement across heterogeneous clinical evidence before generating diagnostic conclusions. The proposed approach augments a frozen VLM with structured contextual signals derived from radiomic statistics, explainability activations, and vocabulary-grounded semantic cues. Instead of producing free-form responses, the model generates structured outputs containing supporting evidence, uncertainty estimates, limitations, and safety notes. We observe that auxiliary signals alone provide limited benefit; performance gains emerge only when these signals are integrated through contextual verification. Experiments on chest X-ray datasets demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
