Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches
Mengya Xu, Mobarakol Islam, Hongliang Ren

TL;DR
This paper introduces SwinMLP-TranCAP, an end-to-end, efficient surgical captioning model that eliminates the need for heavy detectors or feature extractors, enabling faster inference in real-time robotic surgery.
Contribution
The paper proposes a novel patch-based shifted window MLP transformer for surgical captioning, extending it to video captioning, and demonstrating competitive performance with simplified architecture.
Findings
Faster inference speed and reduced computation compared to previous models.
Maintains performance on surgical datasets despite architectural simplification.
Extended to video captioning using 3D patches and windows.
Abstract
Surgical captioning plays an important role in surgical instruction prediction and report generation. However, the majority of captioning models still rely on the heavy computational object detector or feature extractor to extract regional features. In addition, the detection model requires additional bounding box annotation which is costly and needs skilled annotators. These lead to inference delay and limit the captioning model to deploy in real-time robotic surgery. For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique. We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation. SwinMLP-TranCAP replaces the multi-head attention module with window-based multi-head MLP. Such deployments primarily…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Cancer-related molecular mechanisms research
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing
