Teach Multimodal LLMs to Comprehend Electrocardiographic Images
Ruoqi Liu, Yuelin Bai, Xiang Yue, Ping Zhang

TL;DR
This paper introduces ECGInstruct, a large instruction tuning dataset, and PULSE, a multimodal LLM for ECG image interpretation, achieving state-of-the-art accuracy and addressing limitations of existing methods.
Contribution
The paper presents ECGInstruct and PULSE, the first large-scale ECG image instruction tuning dataset and a specialized multimodal LLM for ECG interpretation.
Findings
PULSE outperforms general MLLMs with 15-30% accuracy improvement.
ECGInstruct enables effective instruction tuning for ECG image tasks.
ECGBench provides a comprehensive benchmark for ECG image interpretation.
Abstract
The electrocardiogram (ECG) is an essential non-invasive diagnostic tool for assessing cardiac conditions. Existing automatic interpretation methods suffer from limited generalizability, focusing on a narrow range of cardiac conditions, and typically depend on raw physiological signals, which may not be readily available in resource-limited settings where only printed or digital ECG images are accessible. Recent advancements in multimodal large language models (MLLMs) present promising opportunities for addressing these challenges. However, the application of MLLMs to ECG image interpretation remains challenging due to the lack of instruction tuning datasets and well-established ECG image benchmarks for quantitative evaluation. To address these challenges, we introduce ECGInstruct, a comprehensive ECG image instruction tuning dataset of over one million samples, covering a wide range of…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The study’s strengths include the development of a large-scale dataset and the introduction of extensive benchmark tasks, both of which contribute positively to the field.
The study lacks a strong technical novelty and technical details. For a fairer comparison, it would be preferable to include fine-tuning results against other LLMs. Additionally, human accuracy should be evaluated, at least partially, for the ECGBench tasks. It is also recommended to assess performance on external validation tasks beyond the ECGBench dataset.
This work contributes a comprehensive ECG image instruction tuning dataset from diverse data sources, which facilitates the advancement of MLLMs in understanding ECG images. The proposed PULSE model surpassed proprietary and open-source MLLMs by a large range in diverse tasks related to ECG image comprehension. The model, data and code have been released.
As stated in the discussion, the dataset contains few multistep instructions, which could undermine the multistep reasoning capability of the PULSE model and limits the report generation performance.
The paper's strengths include: - many SOTA MLLMs as baselines - Evaluation on a wide range of tasks - Preparation of a benchmark that could be used in future works. This may be relevant for the field. - Preparation of an instruction-tuning dataset that could be used in future works. Again, this may be relevant for the field.
The weaknesses include: - Motivation: while there may be cases where only printouts of ECGs are available, the need for AI models on ECG printouts is still questionable. - Related works: Current multimodally trained ECG models are neither discussed nor compared against, e.g.: Radhakrishnan et al. (2023), "Cross-modal autoencoder framework learns holistic representations of cardiovascular state", https://www.nature.com/articles/s41467-023-38125-0, Turgut et al. (2023), "Unlocking the Diagnostic
- The first work to treat ECG signals as visual signals rather than time series. - Builds the first benchmark for ECG instruction tuning with image input. To the best of the reviewer's knowledge, this is the first work converting ECG signals into visual input, allowing ECG analysis to benefit from the advancements in the Visual Language Model (VLM) community. Additionally, the authors establish a comprehensive benchmark for evaluating ECG image instruction tuning, opening a new domain for ECG a
Even though this work is well-executed and evaluated on multiple datasets, it still has some weaknesses: - **Reliance on Accuracy and LLM Scoring**: The evaluation metrics are based solely on accuracy or LLM scoring. Incorporating human evaluation could enhance benchmark quality, as current LLMs are not specialized for ECG-related text understanding. I suggest using human evaluation for all open-ended tasks. - **ViT Model Limitations**: The Vision Transformer (ViT) model is frozen, and while L
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsECG Monitoring and Analysis · Advanced Computational Techniques and Applications · Oil and Gas Production Techniques
MethodsPULSE
