Computed Tomography Visual Question Answering with Cross-modal Feature Graphing
Yuanhe Tian, Chen Su, Junwen Duan, Yan Song

TL;DR
This paper introduces a novel CT VQA framework that uses a cross-modal feature graph and graph convolutional networks to improve reasoning accuracy in medical image question answering.
Contribution
It proposes a cross-modal graph representation and attentive graph convolutional network to better capture spatial and inter-slice relationships in CT data for VQA.
Findings
Outperforms baseline methods on M3D-VQA benchmark
Enhances reasoning capabilities in CT VQA tasks
Achieves more accurate and robust answers
Abstract
Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
