HEARTS: Benchmarking LLM Reasoning on Health Time Series

Sirui Li; Shuhan Xiao; Mihir Joshi; Ahmed Metwally; Daniel McDuff; Wei Wang; Yuzhe Yang

arXiv:2603.06638·cs.LG·March 17, 2026

HEARTS: Benchmarking LLM Reasoning on Health Time Series

Sirui Li, Shuhan Xiao, Mihir Joshi, Ahmed Metwally, Daniel McDuff, Wei Wang, Yuzhe Yang

PDF

Open Access 1 Datasets

TL;DR

HEARTS is a comprehensive benchmark that evaluates large language models' reasoning abilities across diverse health time series data, revealing current limitations and guiding future improvements.

Contribution

The paper introduces HEARTS, a unified benchmark with 16 datasets and 110 tasks for assessing LLM reasoning over health time series, addressing existing gaps.

Findings

01

LLMs underperform compared to specialized models

02

Performance is weakly correlated with general reasoning scores

03

LLMs struggle with multi-step temporal reasoning and scale poorly with complexity

Abstract

The rise of large language models (LLMs) has shifted time series analysis from narrow analytics to general-purpose reasoning. Yet, existing benchmarks cover only a small set of health time series modalities and tasks, failing to reflect the diverse domains and extensive temporal dependencies inherent in real-world physiological modeling. To bridge these gaps, we introduce HEARTS (Health Reasoning over Time Series), a unified benchmark for evaluating hierarchical reasoning capabilities of LLMs over general health time series. HEARTS integrates 16 real-world datasets across 12 health domains and 20 signal modalities, and defines a comprehensive taxonomy of 110 tasks grouped into four core capabilities: Perception, Inference, Generation, and Deduction. Evaluating 14 state-of-the-art LLMs on more than 20K test samples reveals intriguing findings. First, LLMs substantially underperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

yang-ai-lab/HEARTS
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Topic Modeling · Explainable Artificial Intelligence (XAI)