Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

Katerina Katsarou; George Zountsas; Karam Tomotaki-Dawoud; Alexander Ehrenhoefer; Paul Chojecki; David Przewozny; Igor Maximilian Sauer; Amira Mouakher; Sebastian Bosse

arXiv:2604.07577·cs.CV·April 10, 2026

Event-Level Detection of Surgical Instrument Handovers in Videos with Interpretable Vision Models

Katerina Katsarou, George Zountsas, Karam Tomotaki-Dawoud, Alexander Ehrenhoefer, Paul Chojecki, David Przewozny, Igor Maximilian Sauer, Amira Mouakher, Sebastian Bosse

PDF

TL;DR

This paper introduces a spatiotemporal vision model combining Vision Transformers and LSTMs for accurate, interpretable detection and classification of surgical instrument handovers in videos, enhancing operating room monitoring.

Contribution

It presents a novel multi-task framework that jointly detects handovers and classifies their direction, improving accuracy and interpretability over existing methods.

Findings

01

Achieved an F1-score of 0.84 for handover detection.

02

Obtained a mean F1-score of 0.72 for direction classification.

03

Outperformed baseline models in direction prediction while maintaining detection performance.

Abstract

Reliable monitoring of surgical instrument exchanges is essential for maintaining procedural efficiency and patient safety in the operating room. Automatic detection of instrument handovers in intraoperative video remains challenging due to frequent occlusions, background clutter, and the temporally evolving nature of interaction events. We propose a spatiotemporal vision framework for event-level detection and direction classification of surgical instrument handovers in surgical videos. The model combines a Vision Transformer (ViT) backbone for spatial feature extraction with a unidirectional Long Short-Term Memory (LSTM) network for temporal aggregation. A unified multi-task formulation jointly predicts handover occurrence and interaction direction, enabling consistent modeling of transfer dynamics while avoiding error propagation typical of cascaded pipelines. Predicted confidence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.