Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Yuanhe Tian; Chen Su; Junwen Duan; Yan Song

arXiv:2507.04333·cs.CV·July 8, 2025

Computed Tomography Visual Question Answering with Cross-modal Feature Graphing

Yuanhe Tian, Chen Su, Junwen Duan, Yan Song

PDF

TL;DR

This paper introduces a novel CT VQA framework that uses a cross-modal feature graph and graph convolutional networks to improve reasoning accuracy in medical image question answering.

Contribution

It proposes a cross-modal graph representation and attentive graph convolutional network to better capture spatial and inter-slice relationships in CT data for VQA.

Findings

01

Outperforms baseline methods on M3D-VQA benchmark

02

Enhances reasoning capabilities in CT VQA tasks

03

Achieves more accurate and robust answers

Abstract

Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual encoders to independently extract features from medical images and clinical questions, which are subsequently combined to generate answers. Specifically, in computed tomography (CT), such approaches are similar to the conventional practices in medical image analysis. However, these approaches pay less attention to the spatial continuity and inter-slice correlations in the volumetric CT data, leading to fragmented and imprecise responses. In this paper, we propose a novel large language model (LLM)-based framework enhanced by a graph representation of salient features. Different from conventional multimodal encoding strategies, our approach constructs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.