PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning

Hung Manh Pham; Jinyang Wu; Xiao Ma; Yiming Zhang; Yixin Xu; Aaqib Saeed; Bin Zhu; Zhou Pan; and Dong Ma

arXiv:2603.03331·cs.CL·May 8, 2026

PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning

Hung Manh Pham, Jinyang Wu, Xiao Ma, Yiming Zhang, Yixin Xu, Aaqib Saeed, Bin Zhu, Zhou Pan, and Dong Ma

PDF

2 Repos 2 Datasets

TL;DR

PulseLM is a large-scale, multimodal dataset linking PPG waveforms with natural language questions and answers, enabling advanced language-grounded physiological inference and model benchmarking.

Contribution

The paper introduces PulseLM, a comprehensive PPG-text dataset with over 1 million segments, standardized protocols, and baseline benchmarks for multimodal PPG analysis.

Findings

01

Established baseline benchmarks with multimodal PPG-aware large language models.

02

Aggregated and harmonized data from 16 sources into 12 downstream tasks.

03

Publicly released dataset and code for community use.

Abstract

Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their compatibility with language-based interfaces and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text question-answering dataset that bridges raw PPG waveforms and natural language through a unified question-answering (QA) formulation. PulseLM aggregates PPG recordings from sixteen publicly available sources and harmonizes heterogeneous annotations into 12 downstream tasks. The dataset comprises over 1 million standardized 10-second PPG segments, associated with nearly 2.5 million…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.