CLEF: Clinically-Guided Contrastive Learning for Electrocardiogram Foundation Models
Yuxuan Shu, Peter H. Charlton, Fahim Kawsar, Jussi Hernesniemi, Mohammad Malekzadeh

TL;DR
This paper introduces CLEF, a novel contrastive learning method that incorporates clinical metadata to improve ECG foundation models, achieving better diagnostic accuracy and robustness across multiple tasks and datasets.
Contribution
CLEF is the first to integrate clinical risk scores into contrastive learning for ECG models, enhancing their clinical relevance without requiring per-sample annotations.
Findings
CLEF outperforms existing self-supervised models in classification and regression tasks.
CLEF achieves comparable performance to supervised models using only routine metadata.
Pretrained CLEF models improve ECG analysis accuracy and scalability.
Abstract
The electrocardiogram (ECG) is a key diagnostic tool in cardiovascular health. Single-lead ECG recording is integrated into both clinical-grade and consumer wearables. While self-supervised pretraining of foundation models on unlabeled ECGs improves diagnostic performance, existing approaches do not incorporate domain knowledge from clinical metadata. We introduce a novel contrastive learning approach that utilizes an established clinical risk score to adaptively weight negative pairs: clinically-guided contrastive learning. It aligns the similarities of ECG embeddings with clinically meaningful differences between subjects, with an explicit mechanism to handle missing metadata. On 12-lead ECGs from 161K patients in the MIMIC-IV dataset, we pretrain single-lead ECG foundation models at three scales, collectively called CLEF, using only routinely collected metadata without requiring…
Peer Reviews
Decision·Submitted to ICLR 2026
The idea of weighting negative pairs using risk score–derived dissimilarities is elegant and well-motivated. It directly connects the latent geometry of embeddings to real-world clinical semantics, improving both interpretability and utility. Strong empirical evaluation and broad benchmarking. CLEF achieves strong performance even when pretrained on 12-lead data but fine-tuned on single-lead (lead I/II) tasks, suggesting it can effectively bridge clinical and consumer-grade ECG domains — a valua
While SCORE2 is a standard cardiovascular risk score, it is not necessarily the optimal or most generalizable measure for all datasets or populations (especially non-European cohorts). Since metadata such as age, gender, and blood pressure are correlated with many downstream labels, the inclusion of such information during pretraining might inadvertently leak label information. Although the weighting matrix W = D⊙M is central to the approach, the paper does not deeply explore sensitivity to α
- **Problem orientation is concrete and clinically motivated.** The work targets label scarcity and weak cross-device/population generalization in single-lead ECG, injecting domain knowledge via routinely captured clinical metadata rather than dense manual labels. - **Method is technically neat and conceptually coherent.** Clinical dissimilarity is converted into a training signal through risk-aware negative reweighting, paired with a distance-alignment term, while missing metadata are modeled
- **Teacher–task mismatch (SCORE2 for immediate diagnostics).** SCORE2 supervises with a 10-year cardiovascular risk (prognosis) derived from demographics and vitals, yet the paper applies it to pretrain models for short-horizon diagnostic tasks (e.g., arrhythmia/rhythm/beat classification) without justifying why a long-term risk surrogate is an appropriate teacher for immediate signal decisions. A minimally necessary control is to compare against an *acute* risk or diagnosis-proximal target.
1. Comprehensive and rigorous experimentation — The study presents extensive evaluations and ablation studies across diverse datasets and clinical tasks, clearly supporting the robustness of the proposed approach. 2. Clarity and strong organization — The methodology and motivation are well articulated with intuitive explanations and clear visualizations, making the paper easy to follow even for non-domain experts.
1. Lack of representation-level validation — Although the authors claim that CLEF adjusts contrastive representations based on clinical risk, the paper lacks direct empirical evidence demonstrating how this modification affects the underlying representation geometry. No analyses such as t-SNE or UMAP visualization, pairwise similarity distributions, or intra/inter-cluster distance comparisons are provided. All evaluations focus solely on downstream performance metrics, leaving the clinically gui
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsECG Monitoring and Analysis · Atrial Fibrillation Management and Outcomes · Cardiac electrophysiology and arrhythmias
