EchoVLM: Measurement-Grounded Multimodal Learning for Echocardiography
Yuheng Li, Yue Zhang, Abdoul Aziz Amadou, Yuxiang Lai, Jike Zhong, Tiziano Passerini, Dorin Comaniciu, Puneet Sharma

TL;DR
EchoVLM introduces a novel multimodal model for echocardiography that leverages a new measurement-grounded dataset and specialized pretraining objectives to improve clinical interpretation tasks.
Contribution
The paper presents EchoVLM, the first measurement-grounded multimodal echocardiography dataset and a vision-language model with novel pretraining objectives tailored for clinical echocardiography interpretation.
Findings
Achieved 86.5% AUC in zero-shot disease classification.
Achieved 95.1% accuracy in view classification.
Demonstrated state-of-the-art performance across five clinical tasks.
Abstract
Echocardiography is the most widely used imaging modality in cardiology, yet its interpretation remains labor-intensive and inherently multimodal, requiring view recognition, quantitative measurements, qualitative assessments, and guideline-based reasoning. While recent vision-language models (VLMs) have achieved broad success in natural images and certain medical domains, their potential in echocardiography has been limited by the lack of large-scale, clinically grounded image-text datasets and the absence of measurement-based reasoning central to echo interpretation. We introduce EchoGround-MIMIC, the first measurement-grounded multimodal echocardiography dataset, comprising 19,065 image-text pairs from 1,572 patients with standardized views, structured measurements, measurement-grounded captions, and guideline-derived disease labels. Building on this resource, we propose EchoVLM, a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
-- Data: EchoGround-MIMIC (~20K measurement-grounded image-text pairs) - first open-source dataset of its kind for echocardiography. Data Processing Innovation -- Successfully integrates MIMIC-IV-ECHO imaging with MIMIC-IV-Note reports. -- Clinical Relevance: Addresses critical gap between free-text narratives and quantitative measurements essential for guideline-based echo diagnosis. -- Comprehensive Evaluation Framework: 36 tasks across 5 clinical application types (classification, retriev
-- Limited Technical Novelty: View-informed loss is just constrained negative sampling; negation-aware loss potentially similar to existing work (e.g., MICCAI 2025, "EchoViewCLIP: Advancing Video Quality Control through High-performance View Recognition of Echocardiography") -- Evaluation Methodology Issues: Primary comparison against unreleased EchoApex (weights/data are not released, based on reported results) instead of available EchoPrime (weights are open to download) raises reproducibilit
- The authors propose a needed dataset for echocardiography. Most of the papers working on VLMs for echo are constrained to private datasets, limiting their applicability and contribution. - The paper comprehensively details different procedures taken to obtain the final dataset from the raw original MIMIC-IV-ECHO. - The negation-aware contrastive objective for CLIP, along with diverse ablation studies.
- The main weakness of the paper, to me, is its limited architectural novelty. Although introducing the new dataset is needed for the community working on echocardiography, the proposed Echo-VLM is similar to prior works originally CLIP and also its variants Echo-CLIP. - Measurements are cropped from overlays and transcribed via an LLM, along with the captions and guideline labels. Although this is acknowledged in the paper and despite manual checks, parsing errors may introduce label noise as
1. A significant contribution of this work is the design of a comprehensive data processing pipeline. This pipeline successfully extracts and aligns a complex, multimodal dataset—comprising images, standardized views, quantitative measurements, measurement-related reports, and disease labels—from the MIMIC-IV-ECHO and MIMIC-IV-Note databases. 2. The paper proposes two novel and clinically-motivated pretraining objectives: view-informed contrastive learning and negation-aware contrastive learnin
1. Disconnect between "Measurement-Grounded" Narrative and Methodology: The paper's core theme is "measurement-grounded multimodal learning." However, there appears to be a significant disconnect between this narrative and the technical implementation. The structured measurements (e.g., JSON-formatted values like "EF: 45%"), which are a key highlight of the new dataset, are not directly utilized as an input during the model's training phase. The model is only trained on the captions derived from
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCOVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
