TL;DR
Fetal-Gauge is a comprehensive benchmark with over 42,000 images and 93,000 QA pairs designed to evaluate vision-language models in fetal ultrasound, exposing significant performance gaps and guiding future improvements.
Contribution
It introduces the first large-scale, standardized VLM benchmark for fetal ultrasound, enabling systematic evaluation and highlighting current model limitations.
Findings
Best model achieves only 55% accuracy, below clinical standards.
Current VLMs struggle with fetal ultrasound tasks due to domain-specific challenges.
Benchmark reveals critical gaps and guides future model development.
Abstract
The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. This gap is primarily due to the modality's challenging nature, operator dependency, and the limited public availability of datasets. To address this gap, we present Fetal-Gauge, the first and largest visual question answering benchmark specifically designed to evaluate VLMs across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
