CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual   Question Localized-Answering in Robotic Surgery

Long Bai; Mobarakol Islam; Hongliang Ren

arXiv:2307.05182·cs.CV·August 22, 2023

CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Long Bai, Mobarakol Islam, Hongliang Ren

PDF

Open Access 1 Repo

TL;DR

This paper introduces CAT-ViL, a novel end-to-end Transformer-based system for surgical visual question localized-answering, which fuses vision and language features to improve understanding and provide answer localization in surgical videos.

Contribution

The paper proposes a new Co-Attention Gated Vision-Language embedding module integrated with a Transformer for surgical VQLA, eliminating the need for detection models and enhancing performance.

Findings

01

Outperforms state-of-the-art methods on MICCAI EndoVis datasets

02

Demonstrates robustness and accuracy in surgical scene understanding

03

Ablation studies confirm effectiveness of proposed components

Abstract

Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

longbai1006/cat-vil
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing