U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo

TL;DR
U2-BENCH is a comprehensive benchmark designed to evaluate large vision-language models on ultrasound understanding, covering diverse tasks and scenarios to identify strengths and challenges in medical ultrasound interpretation.
Contribution
This work introduces U2-BENCH, the first extensive benchmark for assessing LVLMs on ultrasound tasks, including 7,241 cases across 15 anatomical regions and 8 clinical tasks.
Findings
Strong performance in image classification
Challenges in spatial reasoning and language generation
Benchmark facilitates targeted improvements in LVLMs for ultrasound
Abstract
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source,…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper presents a well-motivated and comprehensive benchmark targeting an underexplored domain—ultrasound imaging. Its breadth (15 anatomies, 8 tasks) and evaluation rigor (20 models, standardized prompts) represent a good contribution. - Twenty sota LVLMs are evaluated and compared, which makes the benchmark comprehensive. This benchmark would be beneficial for the medical multimodal community
- While I appreciate the substantial effort invested in constructing this benchmark and the meticulous annotation process, the paper currently lacks sufficient conceptual or analytical insights for the research community. As noted in Lines 132–134, prior work such as GMAI-MMBench has already included ultrasound-related evaluation scenarios. Although U2-BENCH expands the dataset scale and task diversity, the incremental novelty over existing benchmarks appears marginal, focusing primarily on scop
The benchmark targets ultrasound, a clinically crucial yet under‑evaluated modality for LVLMs, with a broad task suite aligned to typical sonography workflows and clear task definitions and prompts per scenario. The evaluation spans 20 modern LVLMs with standardized prompt formats and metrics, including most popular and current SOTA models, open- as well as closed-source. The dataset curation aggregates many sources and applies multi‑stage QA with automated filtering plus manual review, and th
The composite U2‑Score’s task weights are proportional to sample counts, which conflates data availability with clinical importance and mixes heterogeneous metrics into a single scalar without uncertainty quantification, making ranking sensitivity high and potentially misaligned with clinician priorities. No uncertainty metrics (e.g., confidence intervals, paired tests, bootstrap CIs) are reported for main tables, so small deltas across 20 models and many tasks may reflect sampling noise or pro
- (Quality) Application of a large range of existing VLMs to the benchmark - (Clarity) Empirical justification for weighing of tasks - (Significance) Extensive coverage of ultrasound understanding datasets, and unification into a single comprehensive benchmark
- Segmentation task underrepresented due to unification of ground truth to bounding boxes and predefined spatial localization - Preprocessing of video data may limit analysis potential
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
