ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

Jungwoo Oh; Hyunseung Chung; Junhee Lee; Min-Gyu Kim; Hangyul Yoon; Ki Seong Lee; Youngchae Lee; Muhan Yeo; Edward Choi

arXiv:2603.14326·cs.LG·March 17, 2026

ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation

Jungwoo Oh, Hyunseung Chung, Junhee Lee, Min-Gyu Kim, Hangyul Yoon, Ki Seong Lee, Youngchae Lee, Muhan Yeo, Edward Choi

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ECG-Reasoning-Benchmark, a comprehensive evaluation framework revealing that current multimodal models lack genuine step-by-step reasoning in ECG interpretation, often relying on superficial cues instead of true visual understanding.

Contribution

The paper presents a new benchmark with over 6,400 samples to systematically assess reasoning in ECG diagnosis, exposing significant gaps in current models' logical deduction capabilities.

Findings

01

Models have high medical knowledge but poor reasoning chain completion (6%).

02

Models fail to ground ECG findings to visual evidence.

03

Current training paradigms do not foster true visual reasoning in ECG interpretation.

Abstract

While Multimodal Large Language Models (MLLMs) show promising performance in automated electrocardiogram interpretation, it remains unclear whether they genuinely perform actual step-by-step reasoning or just rely on superficial visual cues. To investigate this, we introduce \textbf{ECG-Reasoning-Benchmark}, a novel multi-turn evaluation framework comprising over 6,400 samples to systematically assess step-by-step reasoning across 17 core ECG diagnoses. Our comprehensive evaluation of state-of-the-art models reveals a critical failure in executing multi-step logical deduction. Although models possess the medical knowledge to retrieve clinical criteria for a diagnosis, they exhibit near-zero success rates (6% Completion) in maintaining a complete reasoning chain, primarily failing to ground the corresponding ECG findings to the actual visual evidence in the ECG signal. These results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Jwoo5/ECG-Reasoning-Benchmark
dataset· 29 dl
29 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsECG Monitoring and Analysis · Explainable Artificial Intelligence (XAI) · Machine Learning in Healthcare