Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Aditya Shukla; Yining Yuan; Ben Tamo; Yifei Wang; Micky Nnamdi; Shaun Tan; Jieru Li; Benoit Marteau; Brad Willingham; May Wang

arXiv:2603.01557·cs.AI·March 3, 2026

Benchmarking LLM Summaries of Multimodal Clinical Time Series for Remote Monitoring

Aditya Shukla, Yining Yuan, Ben Tamo, Yifei Wang, Micky Nnamdi, Shaun Tan, Jieru Li, Benoit Marteau, Brad Willingham, May Wang

PDF

Open Access

TL;DR

This paper introduces an event-based evaluation framework for assessing the clinical accuracy of LLM-generated summaries of multimodal time-series data, revealing that traditional metrics often overlook critical event fidelity.

Contribution

It presents a novel event-aware evaluation protocol and benchmarks different summarization approaches, highlighting the importance of event-level correctness in clinical summaries.

Findings

01

Vision-based approach achieves 45.7% abnormality recall.

02

High semantic similarity does not guarantee event accuracy.

03

Event-aware metrics reveal gaps in existing summarization methods.

Abstract

Large language models (LLMs) can generate fluent clinical summaries of remote therapeutic monitoring time series. However, it remains unclear whether these narratives faithfully capture clinically significant events, such as sustained abnormalities. Existing evaluation metrics primarily focus on semantic similarity and linguistic quality, leaving event-level correctness largely unmeasured. To address this gap, we introduce an event-based evaluation framework for multimodal time-series summarization using the Technology-Integrated Health Management (TIHM)-1.5 dementia monitoring dataset. Clinically grounded daily events are derived through rule-based abnormal thresholds and temporal persistence criteria. Model-generated summaries are then aligned with these structured facts. Our evaluation protocol measures abnormality recall, duration recall, measurement coverage, and hallucinated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Time Series Analysis and Forecasting · Electronic Health Records Systems