LoViT: Long Video Transformer for Surgical Phase Recognition

Yang Liu; Maxence Boels; Luis C. Garcia-Peraza-Herrera; Tom Vercauteren; Prokar Dasgupta; Alejandro Granados; Sebastien Ourselin

arXiv:2305.08989·cs.CV·June 6, 2025·6 cites

LoViT: Long Video Transformer for Surgical Phase Recognition

Yang Liu, Maxence Boels, Luis C. Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin

PDF

Open Access 1 Repo

TL;DR

LoViT is a novel transformer-based method for surgical phase recognition in long videos, effectively fusing local and global temporal features to improve accuracy over previous approaches.

Contribution

The paper introduces a two-stage Long Video Transformer (LoViT) that combines multi-scale temporal aggregation with phase transition-aware supervision for improved surgical phase recognition.

Findings

01

Outperforms state-of-the-art on Cholec80 and AutoLaparo datasets.

02

Achieves significant improvements in accuracy and Jaccard index.

03

Effectively handles long surgical videos with complex temporal dependencies.

Abstract

Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information that combines a temporally-rich spatial feature extractor and a multi-scale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MRUIL/LoViT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Multi-Head Attention · Adam · Softmax · Layer Normalization · Byte Pair Encoding