TL;DR
This paper introduces the SurgSTU-Pipeline to generate detailed surgical video datasets, enabling better spatial-temporal understanding in vision-language models, with demonstrated improvements through fine-tuning.
Contribution
The authors present a deterministic pipeline for creating large-scale surgical datasets with fine-grained spatial-temporal annotations, filling a key gap in surgical video understanding.
Findings
State-of-the-art VLMs struggle in zero-shot spatial-temporal tasks.
Fine-tuning VLMs on SurgSTU improves spatial-temporal understanding.
The SurgSTU dataset contains 7,515 clips with 150k QA samples.
Abstract
Surgical video understanding is a crucial prerequisite for advancing Computer-Assisted Surgery. While vision-language models (VLMs) have recently been applied to the surgical domain, existing surgical vision-language datasets lack in capturing and evaluating complex, interleaved spatial-temporal dynamics. Creating large scale datasets that accurately represent fine-grained spatial-temporal relationships in surgical videos is challenging due to costly manual annotations or error-prone generation using large language models. To address this gap, we introduce the SurgSTU-Pipeline, a deterministic generation pipeline featuring temporal and spatial continuity filtering to reliably create surgical datasets for fine-grained spatial-temporal multimodal understanding. Applying this pipeline to publicly available surgical datasets, we create the SurgSTU dataset, comprising 7515 video clips…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
