Fine-grained Contrastive Learning for ECG-Report Alignment with Waveform Enhancement

Haitao Li; Che Liu; Zhengyao Ding; Ziyi Liu; Wenqi Shao; Zhengxing Huang

arXiv:2505.11939·eess.SP·September 30, 2025

Fine-grained Contrastive Learning for ECG-Report Alignment with Waveform Enhancement

Haitao Li, Che Liu, Zhengyao Ding, Ziyi Liu, Wenqi Shao, Zhengxing Huang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces FG-CLEP, a novel ECG-report alignment method that achieves fine-grained, tag-specific alignment by leveraging waveform enhancement and semantic similarity, significantly improving performance in zero-shot and linear probing tasks.

Contribution

The paper presents a new fine-grained contrastive learning approach for ECG-report alignment, incorporating waveform feature recovery with LLMs and a semantic similarity matrix to reduce false negatives.

Findings

01

FG-CLEP outperforms state-of-the-art methods in zero-shot prediction.

02

The coarse-to-fine training improves waveform feature recovery.

03

Fine-grained reports enhance downstream task performance.

Abstract

Electrocardiograms (ECGs) are essential for diagnosing cardiovascular diseases. However, existing ECG-Report contrastive learning methods focus on whole-ECG and report alignment, missing the link between local ECG features and individual report tags. In this paper, we propose FG-CLEP (Fine-Grained Contrastive Language ECG Pre-training), which achieves fine-grained alignment between specific ECG segments and each tag in the report via tag-specific ECG representations. Furthermore, we found that nearly 55\% of ECG reports in the MIMIC-ECG training dataset lack detailed waveform features, which hinders fine-grained alignment. To address this, we introduce a coarse-to-fine training process that leverages large language models (LLMs) to recover these missing waveform features and validate the LLM outputs using a coarse model. Additionally, fine-grained alignment at the tag level, rather than…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The problem of ECG–language pretraining is important and has clear clinical relevance. 2. The proposed framework is well-motivated and technically sound; the LLM-enriched waveform features are novel. 3. Experiments across multiple tasks show competitive or superior performance.

Weaknesses

1. In the “Fine-Grained Contrastive Learning Objective,” it appears that a single tag is randomly sampled per ECG. If so, this could ignore information from other relevant segments and reduce feature richness. Please clarify the sampling procedure and consider reporting results with multi-tag aggregation or coverage-controlled sampling. 2. While the tag sampling strategy plausibly aligns better with zero-shot text prompts and could help zero-shot performance, the improvements on linear probing

Reviewer 02Rating 6Confidence 4

Strengths

The paper presents FG-CLEP, an innovative framework that advances ECG–text alignment by introducing fine-grained contrastive learning between ECG patches and report tags. Its coarse-to-fine training pipeline, which integrates large language models (LLMs) for recovering missing waveform features, demonstrates strong methodological creativity and practical relevance. Extensive experiments across six datasets show consistent performance gains in both zero-shot and linear probing tasks, validating t

Weaknesses

The dependence on LLMs for generating waveform features may introduce bias or inconsistency, even with CLEP-based validation. The evaluation lacks human expert assessment of fine-grained alignment quality beyond AUC metrics, limiting interpretability claims. The training efficiency and computational cost of multi-stage fine-tuning and LLM querying are not clearly quantified, raising concerns about scalability in clinical deployment. The model’s reliance on tag-level alignment assumes structu

Reviewer 03Rating 2Confidence 5

Strengths

- Introduces a cross-attention patch-tag alignment and LLM-based fine-grained training pipeline for ECG-text pretraining.

Weaknesses

- This work has limited technical novelty. The work depends on the LLM usage instead of core modeling. Also, the LLM-generated GF reports are clinically unverifiable and risk introducing noise or bias and especially even label leakage when doing zero-shot experiments. - Lack comparisions with recent ECG-text modeling works.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · ECG Monitoring and Analysis · Speech Recognition and Synthesis

MethodsADaptive gradient method with the OPTimal convergence rate · Contrastive Learning