ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Chao Li; Cailiang Liu; Ang Gao; Kexin Deng; Shu Zhang; Langping Xu; Xiaotong Shi; Xionghao Ding; Jian Pei; Xun Jiang

arXiv:2604.02834·cs.AI·April 6, 2026

ESL-Bench: An Event-Driven Synthetic Longitudinal Benchmark for Health Agents

Chao Li, Cailiang Liu, Ang Gao, Kexin Deng, Shu Zhang, Langping Xu, Xiaotong Shi, Xionghao Ding, Jian Pei, Xun Jiang

PDF

1 Datasets

TL;DR

ESL-Bench is a synthetic, event-driven benchmark designed to evaluate health agents over multi-year trajectories, enabling systematic assessment of reasoning across diverse health data sources.

Contribution

It introduces a comprehensive synthetic benchmark with structured ground truth for evaluating health agents' reasoning capabilities across multiple query types.

Findings

01

DB agents outperform memory RAG baselines by 10-20% on key query types.

02

Evaluation across five question dimensions shows varied method strengths and weaknesses.

03

Synthetic data allows for controlled, scalable assessment of health reasoning algorithms.

Abstract

Longitudinal health agents must reason across multi-source trajectories that combine continuous device streams, sparse clinical exams, and episodic life events - yet evaluating them is hard: real-world data cannot be released at scale, and temporally grounded attribution questions seldom admit definitive answers without structured ground truth. We present ESL-Bench, an event-driven synthesis framework and benchmark providing 100 synthetic users, each with a 1-5 year trajectory comprising a health profile, a multi-phase narrative plan, daily device measurements, periodic exam records, and an event log with explicit per-indicator impact parameters. Each indicator follows a baseline stochastic process driven by discrete events with sigmoid-onset, exponential-decay kernels under saturation and projection constraints; a hybrid pipeline delegates sparse semantic artifacts to LLM-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

healthmemoryarena/ESL-Bench
dataset· 2.5k dl
2.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.