PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering   in Pituitary Surgery

Runlong He; Mengya Xu; Adrito Das; Danyal Z. Khan; Sophia Bano; Hani; J. Marcus; Danail Stoyanov; Matthew J. Clarkson; Mobarakol Islam

arXiv:2405.13949·cs.CV·May 24, 2024

PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery

Runlong He, Mengya Xu, Adrito Das, Danyal Z. Khan, Sophia Bano, Hani, J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces PitVQA, a specialized dataset and a novel image-grounded text embedding model for visual question answering in pituitary surgery, enhancing intra-operative decision support with improved accuracy.

Contribution

The paper presents a new surgical VQA dataset and a novel image-grounded text embedding model that effectively aligns image and text features for better surgical question answering.

Findings

01

Achieved 8% and 9% accuracy improvements on PitVQA and EndoVis18-VQA datasets.

02

Developed a joint embedding and cross-attention based image-grounded text model.

03

Enhanced intra-operative decision-making through improved VQA performance.

Abstract

Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mobarakol/pitvqa
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications