# Surgical instrument-tissue interaction recognition with multi-task-attention video transformer

**Authors:** Lennart Maack, Berk Cam, Sarah Latus, Tobias Maurer, Alexander Schlaefer

PMC · DOI: 10.1007/s11548-025-03546-3 · International Journal of Computer Assisted Radiology and Surgery · 2025-11-11

## TL;DR

This paper introduces a new video transformer model with a multi-task attention module to better recognize surgical instrument-tissue interactions using detailed temporal information.

## Contribution

The novel multi-task-attention module with cross-attention and a gating mechanism improves interaction recognition performance.

## Key findings

- Using fine-grained temporal context improves interaction recognition compared to static images or coarse sampling.
- The proposed MTAM outperforms state-of-the-art models on two datasets with 4.8% and 5.9% relative improvements.
- An optimal sampling rate of 6-8 Hz was identified for capturing surgical dynamics.

## Abstract

The recognition of surgical instrument-tissue interactions can enhance the surgical workflow analysis, improve automated safety systems and enable skill assessment in minimally invasive surgery. However, current deep learning methods for surgical instrument-tissue interaction recognition often rely on static images or coarse temporal sampling, limiting their ability to capture rapid surgical dynamics. Therefore, this study systematically investigates the impact of incorporating fine-grained temporal context into deep learning models for interaction recognition.

We conduct extensive experiments with multiple curated video-based datasets to investigate the influence of fine-grained temporal context for the task of instrument-tissue interaction recognition using video transformer with spatio-temporal feature extraction capabilities. Additionally, we propose a multi-task-attention module that utilizes cross-attention and a gating mechanism to improve communication between the subtasks of identifying the surgical instrument, atomic action, and anatomical target.

Our study demonstrates the benefit of utilizing the fine-grained temporal context for recognition of instrument-tissue interactions, with an optimal sampling rate of 6-8 Hz identified for the examined datasets. Furthermore, our proposed MTAM significantly outperforms state-of-the-art multi-task video transformer on the CholecT45-Vid and GraSP-Vid datasets, achieving relative increases of \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$4.8 \%$$\end{document}4.8% and \documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$5.9 \%$$\end{document}5.9% in surgical instrument-tissue interaction recognition, respectively.

In this work, we demonstrate the benefits of using a fine-grained temporal context rather than static images or coarse temporal context for the task of surgical instrument-tissue interaction recognition. We also show that leveraging cross-attention with spatio-temporal features from various subtasks leads to improved surgical instrument-tissue interaction recognition performance. The project is available at: https://lennart-maack.github.io/InstrTissRec-MTAM.

## Full-text entities

- **Diseases:** postoperative pain (MESH:D010149), MTAM (MESH:C566973)
- **Chemicals:** CholecT45 (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12929233/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12929233/full.md

## References

3 references — full list in the complete paper: https://tomesphere.com/paper/PMC12929233/full.md

---
Source: https://tomesphere.com/paper/PMC12929233