ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang; Xiaoman Zhang; Sung Eun Kim; Ankit Pal; Pranav Rajpurkar

arXiv:2604.10916·cs.CV·April 20, 2026

ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding

Xucheng Wang, Xiaoman Zhang, Sung Eun Kim, Ankit Pal, Pranav Rajpurkar

PDF

TL;DR

ReXSonoVQA is a new video question-answering benchmark designed to evaluate vision-language models' understanding of ultrasound procedures, highlighting current limitations in causal reasoning and procedural comprehension.

Contribution

The paper introduces ReXSonoVQA, a novel video QA benchmark for ultrasound, and assesses existing models, revealing gaps in causal reasoning and procedural understanding.

Findings

01

VLMs can extract some procedural info from ultrasound videos

02

Troubleshooting questions remain challenging for current models

03

Minimal gains over text-only baselines indicate limitations in causal reasoning

Abstract

Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.