Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren

TL;DR
This paper introduces Surgical-VQLA, a novel transformer-based system for localized visual question answering in robotic surgery videos, improving scene understanding and localization accuracy without relying on object detectors.
Contribution
It proposes a gated vision-language embedding and a detection head within a transformer framework for localized surgical question-answering, addressing dataset scarcity and modality fusion issues.
Findings
Outperforms existing benchmarks in surgical VQA tasks.
Effectively localizes surgical areas related to questions.
Enhances understanding of complex surgical scenes.
Abstract
Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (1) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (2) current fusion strategy of heterogeneous modalities like text and image is naive; (3) the localized answering is missing, which is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer
