Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer   Using Patches

Mengya Xu; Mobarakol Islam; Hongliang Ren

arXiv:2207.00113·cs.CV·July 4, 2022

Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches

Mengya Xu, Mobarakol Islam, Hongliang Ren

PDF

Open Access 1 Repo

TL;DR

This paper introduces SwinMLP-TranCAP, an end-to-end, efficient surgical captioning model that eliminates the need for heavy detectors or feature extractors, enabling faster inference in real-time robotic surgery.

Contribution

The paper proposes a novel patch-based shifted window MLP transformer for surgical captioning, extending it to video captioning, and demonstrating competitive performance with simplified architecture.

Findings

01

Faster inference speed and reduced computation compared to previous models.

02

Maintains performance on surgical datasets despite architectural simplification.

03

Extended to video captioning using 3D patches and windows.

Abstract

Surgical captioning plays an important role in surgical instruction prediction and report generation. However, the majority of captioning models still rely on the heavy computational object detector or feature extractor to extract regional features. In addition, the detection model requires additional bounding box annotation which is costly and needs skilled annotators. These lead to inference delay and limit the captioning model to deploy in real-time robotic surgery. For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique. We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation. SwinMLP-TranCAP replaces the multi-head attention module with window-based multi-head MLP. Such deployments primarily…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xumengyaamy/swinmlp_trancap
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Cancer-related molecular mechanisms research

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing