# Signal or noise? Evaluating commonly used attribution methods for explaining deep neural networks in electrocardiogram classification

**Authors:** Bauke K O Arends, Wouter A C van Amsterdam, Pim van der Harst, Maarten van Smeden, René van Es, Rutger R van de Leur

PMC · DOI: 10.1093/ehjdh/ztag038 · European Heart Journal. Digital Health · 2026-03-10

## TL;DR

This study evaluates 12 attribution methods used to explain deep learning models in ECG classification, finding them unreliable and inconsistent for clinical use.

## Contribution

The study introduces a comprehensive evaluation framework for attribution methods in ECG analysis, revealing their instability and limited reliability.

## Key findings

- Attribution methods showed low correlation and high variability across different methods.
- Self-consistency across model initializations was moderate but not robust.
- Some attribution methods did not converge to zero when model weights were randomized.

## Abstract

Attribution-based explainability methods are widely used in electrocardiogram (ECG) analysis to interpret predictions from ‘black-box’ deep neural networks (DNNs). To be useful in clinical applications, attribution methods must produce explanations that are both clear and reflective of the model’s inner workings. This study evaluates 12 attribution methods in DNN-based ECG classification.

We analysed 12 attribution methods using a dataset of 873 710 median beat ECGs spanning nine diagnostic classes. Methods were applied to convolutional neural network-based models trained for ECG classification. Performance was evaluated across four experiments: inter-method similarity, self-consistency, dependence on model weights, and ability to identify features important for model inference. All task models achieved an area under the receiver operating curve above 0.95. Attribution methods demonstrated low correlation and high variability across inter-method comparisons. Self-consistency across random model initializations was moderate for most methods (mean correlation 0.41–0.65). Randomizing model weights led to rapid loss of correlation, although some methods did not converge to zero. Perturbation of input data revealed differences in how well attribution methods identified features relevant to model performance.

Attribution methods demonstrated limited reliability, instability across model variants and incomplete dependence on learned parameters, constraining their utility in high-stakes settings such as healthcare. These findings suggest that attribution techniques should be used cautiously and supported by task-specific sanity checks. Approaches grounded in rigorous validation, inherently interpretable modelling or counterfactual explanations may better support clinically meaningful insight.

Graphical Abstract

## Full-text entities

- **Diseases:** sinus tachycardia (MESH:D013616), left ventricular hypertrophy (MESH:D017379), sinus bradycardia (MESH:D012804), atrial fibrillation (MESH:D001281), AV block (MESH:D054537), AI (MESH:C538142), ischaemia (MESH:D007511), XAI (MESH:C538243), cardiovascular disease (MESH:D002318), left bundle branch block (MESH:D002037)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12980500/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12980500/full.md

## References

32 references — full list in the complete paper: https://tomesphere.com/paper/PMC12980500/full.md

---
Source: https://tomesphere.com/paper/PMC12980500