MuST: Multi-Scale Transformers for Surgical Phase Recognition

Alejandra P\'erez; Santiago Rodr\'iguez; Nicol\'as Ayobi and; Nicol\'as Aparicio; Eug\'enie Dessevres; Pablo Arbel\'aez

arXiv:2407.17361·cs.CV·July 25, 2024

MuST: Multi-Scale Transformers for Surgical Phase Recognition

Alejandra P\'erez, Santiago Rodr\'iguez, Nicol\'as Ayobi and, Nicol\'as Aparicio, Eug\'enie Dessevres, Pablo Arbel\'aez

PDF

1 Repo

TL;DR

MuST introduces a multi-scale Transformer approach for surgical phase recognition, effectively capturing short, mid, and long-term information in videos, leading to improved accuracy over existing methods.

Contribution

The paper presents MuST, a novel Transformer-based model that combines multi-scale temporal sampling with long-term reasoning for better surgical phase recognition.

Findings

01

Outperforms previous state-of-the-art on three benchmarks

02

Effectively captures multi-scale temporal dependencies

03

Enhances long-term reasoning in surgical videos

Abstract

Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems as it enables automated understanding of sequential procedural stages. Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases. Thus, they struggle to simultaneously capture short-, mid-, and long-term information necessary to fully understand complex surgical procedures. To address these issues, we propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach that combines a Multi-Term Frame encoder with a Temporal Consistency Module to capture information across multiple temporal scales of a surgical video. Our Multi-Term Frame Encoder computes interdependencies across a hierarchy of temporal scales by sampling sequences at increasing strides around the frame of interest. Furthermore, we employ a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

BCV-Uniandes/MuST
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsByte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Softmax · Attention Is All You Need · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Multi-Head Attention · Dense Connections