CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
Long Bai, Mobarakol Islam, Hongliang Ren

TL;DR
This paper introduces CAT-ViL, a novel end-to-end Transformer-based system for surgical visual question localized-answering, which fuses vision and language features to improve understanding and provide answer localization in surgical videos.
Contribution
The paper proposes a new Co-Attention Gated Vision-Language embedding module integrated with a Transformer for surgical VQLA, eliminating the need for detection models and enhancing performance.
Findings
Outperforms state-of-the-art methods on MICCAI EndoVis datasets
Demonstrates robustness and accuracy in surgical scene understanding
Ablation studies confirm effectiveness of proposed components
Abstract
Medical students and junior surgeons often rely on senior surgeons and specialists to answer their questions when learning surgery. However, experts are often busy with clinical and academic work, and have little time to give guidance. Meanwhile, existing deep learning (DL)-based surgical Visual Question Answering (VQA) systems can only provide simple answers without the location of the answers. In addition, vision-language (ViL) embedding is still a less explored research in these kinds of tasks. Therefore, a surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios, which does not require feature extraction through detection models. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Absolute Position Encodings · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Label Smoothing
