LoViT: Long Video Transformer for Surgical Phase Recognition
Yang Liu, Maxence Boels, Luis C. Garcia-Peraza-Herrera, Tom Vercauteren, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin

TL;DR
LoViT is a novel transformer-based method for surgical phase recognition in long videos, effectively fusing local and global temporal features to improve accuracy over previous approaches.
Contribution
The paper introduces a two-stage Long Video Transformer (LoViT) that combines multi-scale temporal aggregation with phase transition-aware supervision for improved surgical phase recognition.
Findings
Outperforms state-of-the-art on Cholec80 and AutoLaparo datasets.
Achieves significant improvements in accuracy and Jaccard index.
Effectively handles long surgical videos with complex temporal dependencies.
Abstract
Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT) for fusing short- and long-term temporal information that combines a temporally-rich spatial feature extractor and a multi-scale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Multi-Head Attention · Adam · Softmax · Layer Normalization · Byte Pair Encoding
