How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Zhen Chen; Qing Xu; Jinlin Wu; Biao Yang; Yuhao Zhai; Geng Guo; Jing Zhang; Yinlu Ding; Nassir Navab; Jiebo Luo

arXiv:2511.01775·cs.CV·November 4, 2025

How Far Are Surgeons from Surgical World Models? A Pilot Study on Zero-shot Surgical Video Generation with Expert Assessment

Zhen Chen, Qing Xu, Jinlin Wu, Biao Yang, Yuhao Zhai, Geng Guo, Jing Zhang, Yinlu Ding, Nassir Navab, Jiebo Luo

PDF

Open Access

TL;DR

This study evaluates the ability of foundation video generation models to produce realistic surgical videos, revealing a gap between visual plausibility and understanding of surgical procedures, and introduces benchmarks for future development.

Contribution

We introduce SurgVeo, a surgical video generation benchmark, and the Surgical Plausibility Pyramid, a framework to assess model outputs from appearance to surgical strategy.

Findings

01

Veo-3 achieves high visual plausibility in generated videos.

02

Models struggle with higher-level surgical reasoning and causal understanding.

03

The study highlights a significant gap between visual realism and surgical plausibility.

Abstract

Foundation models in video generation are demonstrating remarkable capabilities as potential world models for simulating the physical world. However, their application in high-stakes domains like surgery, which demand deep, specialized causal knowledge rather than general physical rules, remains a critical unexplored gap. To systematically address this challenge, we present SurgVeo, the first expert-curated benchmark for video generation model evaluation in surgery, and the Surgical Plausibility Pyramid (SPP), a novel, four-tiered framework tailored to assess model outputs from basic appearance to complex surgical strategy. On the basis of the SurgVeo benchmark, we task the advanced Veo-3 model with a zero-shot prediction task on surgical clips from laparoscopic and neurosurgical procedures. A panel of four board-certified surgeons evaluates the generated videos according to the SPP.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSurgical Simulation and Training · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications