A Medical Multimodal Diagnostic Framework Integrating Vision-Language Models and Logic Tree Reasoning
Zelin Zang, Wenyi Gu, Siqi Ma, Dan Yang, Yue Shen, Zhu Zhang, Guohui Fan, Wing-Kuen Ling, Fuji Yang

TL;DR
This paper introduces a multimodal diagnostic framework that combines vision-language models with logic-based reasoning to improve accuracy and interpretability in medical AI diagnostics.
Contribution
It presents a novel framework integrating vision-language alignment with logic tree reasoning, enhancing trustworthiness and interpretability in multimodal medical diagnosis.
Findings
Improved diagnostic accuracy on MedXpertQA benchmark
More interpretable reasoning traces produced
Competitive performance on text-only tasks
Abstract
With the rapid growth of large language models (LLMs) and vision-language models (VLMs) in medicine, simply integrating clinical text and medical imaging does not guarantee reliable reasoning. Existing multimodal models often produce hallucinations or inconsistent chains of thought, limiting clinical trust. We propose a diagnostic framework built upon LLaVA that combines vision-language alignment with logic-regularized reasoning. The system includes an input encoder for text and images, a projection module for cross-modal alignment, a reasoning controller that decomposes diagnostic tasks into steps, and a logic tree generator that assembles stepwise premises into verifiable conclusions. Evaluations on MedXpertQA and other benchmarks show that our method improves diagnostic accuracy and yields more interpretable reasoning traces on multimodal tasks, while remaining competitive on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
