Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark

Rulin Zhou; Wenlong He; An Wang; Jianhang Zhang; Xuanhui Zeng; Xi Zhang; Chaowei Zhu; Haijun Hu; Hongliang Ren

arXiv:2511.12026·cs.CV·November 18, 2025

Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark

Rulin Zhou, Wenlong He, An Wang, Jianhang Zhang, Xuanhui Zeng, Xi Zhang, Chaowei Zhu, Haijun Hu, Hongliang Ren

PDF

Open Access 1 Video

TL;DR

This paper introduces VL-SurgPT, a large-scale multimodal dataset combining visual and textual data for surgical point tracking, and proposes a text-guided tracking method that enhances robustness in challenging surgical environments.

Contribution

The paper presents the first multimodal surgical tracking dataset with semantic descriptions and introduces a novel text-guided tracking approach to improve robustness under adverse conditions.

Findings

01

Semantic descriptions improve tracking accuracy.

02

Text-guided approach outperforms vision-only methods.

03

Robustness increases in challenging visual scenarios.

Abstract

Accurate point tracking in surgical environments remains challenging due to complex visual conditions, including smoke occlusion, specular reflections, and tissue deformation. While existing surgical tracking datasets provide coordinate information, they lack the semantic context necessary to understand tracking failure mechanisms. We introduce VL-SurgPT, the first large-scale multimodal dataset that bridges visual tracking with textual descriptions of point status in surgical scenes. The dataset comprises 908 in vivo video clips, including 754 for tissue tracking (17,171 annotated points across five challenging scenarios) and 154 for instrument tracking (covering seven instrument types with detailed keypoint annotations). We establish comprehensive benchmarks using eight state-of-the-art tracking methods and propose TG-SurgPT, a text-guided tracking approach that leverages semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Bridging Vision and Language for Robust Context-Aware Surgical Point Tracking: The VL-SurgPT Dataset and Benchmark· underline

Taxonomy

TopicsRobotics and Sensor-Based Localization · Surgical Simulation and Training · 3D Shape Modeling and Analysis