SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

Xiyang Huang; Jiawei Lin; Keying Wu; Jiaxin Huang; Kailai Yang; Renxiong Wei; Cheng zeng; Jiayi Xiang; Ziyan Kuang; Min Peng; Qianqian Xie; Sophia Ananiadou

arXiv:2604.09037·cs.CV·April 13, 2026

SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

Xiyang Huang, Jiawei Lin, Keying Wu, Jiaxin Huang, Kailai Yang, Renxiong Wei, Cheng zeng, Jiayi Xiang, Ziyan Kuang, Min Peng, Qianqian Xie, Sophia Ananiadou

PDF

TL;DR

SiMing-Bench is a new benchmark for evaluating how well multimodal models understand procedural correctness in clinical videos by tracking ongoing interactions and state updates.

Contribution

It introduces a novel benchmark and dataset for assessing models' ability to judge procedural correctness from continuous interactions in clinical videos.

Findings

01

Models show weak agreement with physician judgments.

02

Global assessments overestimate models' procedural understanding.

03

Performance bottleneck is modeling interaction-driven state updates.

Abstract

Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.