An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

Lennart Maack; Alexander Schlaefer

arXiv:2604.00784·cs.CV·April 2, 2026

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models

Lennart Maack, Alexander Schlaefer

PDF

1 Repo

TL;DR

This paper introduces the SurgSTU-Pipeline to generate detailed surgical video datasets, enabling better spatial-temporal understanding in vision-language models, with demonstrated improvements through fine-tuning.

Contribution

The authors present a deterministic pipeline for creating large-scale surgical datasets with fine-grained spatial-temporal annotations, filling a key gap in surgical video understanding.

Findings

01

State-of-the-art VLMs struggle in zero-shot spatial-temporal tasks.

02

Fine-tuning VLMs on SurgSTU improves spatial-temporal understanding.

03

The SurgSTU dataset contains 7,515 clips with 150k QA samples.

Abstract

Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.