Surgical-VQLA: Transformer with Gated Vision-Language Embedding for   Visual Question Localized-Answering in Robotic Surgery

Long Bai; Mobarakol Islam; Lalithkumar Seenivasan; Hongliang Ren

arXiv:2305.11692·cs.CV·May 22, 2023·1 cites

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren

PDF

Open Access 2 Repos

TL;DR

This paper introduces Surgical-VQLA, a novel transformer-based system for localized visual question answering in robotic surgery videos, improving scene understanding and localization accuracy without relying on object detectors.

Contribution

It proposes a gated vision-language embedding and a detection head within a transformer framework for localized surgical question-answering, addressing dataset scarcity and modality fusion issues.

Findings

01

Outperforms existing benchmarks in surgical VQA tasks.

02

Effectively localizes surgical areas related to questions.

03

Enhances understanding of complex surgical scenes.

Abstract

Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (1) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (2) current fusion strategy of heterogeneous modalities like text and image is naive; (3) the localized answering is missing, which is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Adam · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer