PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery
Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani, J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam

TL;DR
This paper introduces PitVQA, a specialized dataset and a novel image-grounded text embedding model for visual question answering in pituitary surgery, enhancing intra-operative decision support with improved accuracy.
Contribution
The paper presents a new surgical VQA dataset and a novel image-grounded text embedding model that effectively aligns image and text features for better surgical question answering.
Findings
Achieved 8% and 9% accuracy improvements on PitVQA and EndoVis18-VQA datasets.
Developed a joint embedding and cross-attention based image-grounded text model.
Enhanced intra-operative decision-making through improved VQA performance.
Abstract
Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
