SurgPub-Video: A Comprehensive Surgical Video Dataset for Enhanced Surgical Intelligence in Vision-Language Model

Yaoqian Li; Xikai Yang; Dunyuan Xu; Yang Yu; Litao Zhao; Xiaowei Hu; Jinpeng Li; Pheng-Ann Heng

arXiv:2508.10054·q-bio.OT·January 21, 2026

SurgPub-Video: A Comprehensive Surgical Video Dataset for Enhanced Surgical Intelligence in Vision-Language Model

Yaoqian Li, Xikai Yang, Dunyuan Xu, Yang Yu, Litao Zhao, Xiaowei Hu, Jinpeng Li, Pheng-Ann Heng

PDF

TL;DR

This paper introduces SurgPub-Video, a large surgical video dataset, a specialized vision-language model SurgLLaVA-Video, and a surgical VQA benchmark, advancing surgical scene understanding and analysis.

Contribution

The paper presents a comprehensive surgical video dataset, a specialized VLM, and a new benchmark, addressing limitations of existing datasets and models in surgical video analysis.

Findings

01

SurgLLaVA-Video outperforms general-purpose models in surgical tasks.

02

The dataset enables training of models with only three billion parameters.

03

Extensive experiments validate the effectiveness of the proposed approach.

Abstract

Vision-Language Models (VLMs) have shown significant potential in surgical scene analysis, yet existing models are limited by frame-level datasets and lack high-quality video data with procedural surgical knowledge. To address these challenges, we make the following contributions: (i) SurgPub-Video, a comprehensive dataset of over 3,000 surgical videos and 25 million annotated frames across 11 specialties, sourced from peer-reviewed clinical journals, (ii) SurgLLaVA-Video, a specialized VLM for surgical video understanding, built upon the TinyLLaVA-Video architecture that supports both video-level and frame-level inputs, and (iii) a video-level surgical Visual Question Answering (VQA) benchmark, covering diverse 11 surgical specialities, such as vascular, cardiology, and thoracic. Extensive experiments, conducted on the proposed benchmark and three additional surgical downstream tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.