PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing
Yiping Xie, Bo Zhao, Mingtong Dai, Jian-Ping Zhou, Yue Sun, Tao Tan, Weicheng Xie, Linlin Shen, Zitong Yu

TL;DR
PhysLLM introduces a novel framework combining large language models with domain-specific components to improve remote physiological sensing accuracy and robustness under challenging conditions.
Contribution
The paper proposes PhysLLM, a collaborative optimization framework that aligns physiological signals with language models and introduces new algorithms for signal stability and cross-modal learning.
Findings
Achieves state-of-the-art accuracy on benchmark datasets.
Demonstrates robustness to illumination changes and motion artifacts.
Effectively integrates visual and textual information for physiological measurement.
Abstract
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce the PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving…
Peer Reviews
Decision·ICLR 2026 Poster
1. Strong empirical results. SOTA intra-dataset HR with gains on BUAA and MMPD. 2. Ablations and component evidence. Removing DDS/VA/TPG degrades metrics; including all yields the best performance, supporting the importance of each piece. 3. Task priors + LLM prompting. The cue design (task, visual, static/statistical) and adaptive fusion are thoughtful ways to inject physiological context into an LLM.
1. Justification for LLM vs. sequence models. It remains unclear whether the LLM is essential beyond acting as a powerful sequence model; comparisons to strong non-LLM long-context baselines (e.g., state-of-the-art time-series Transformers without language pretraining) are missing. 2. Compute/latency & deployability. The paper does not quantify training/inference cost (LLM size, parameter-efficient tuning specifics, throughput on typical devices) or real-time feasibility, which is critical for
1. The idea of introducing LLMs into rPPG estimation is timely. 2. The framework is clearly described, and the modular design (DDS, TPG, APL) is logically motivated. 3. Writing quality and figures are generally good, helping the reader follow the methodology.
1. Limited benchmark coverage: the evaluation misses some widely recognized and more challenging datasets (e.g., V4V, VIPL-HR), which are important for validating cross-domain robustness in practical settings. 2. The choice of LLM backbone (DeepSeek-1.5B) and its integration details are only briefly discussed, it is unclear whether improvements mainly come from the LLM or other architectural refinements. 3. Some modules (e.g., APL and TPG) would benefit from clearer ablation or visualization to
1. Novel cross-modal architecture: The paper introduces PhysLLM, a framework that integrates text, vision, and physiological signals with ideas like Text Prototype Guidance and adaptive cue prompting, representing a meaningful conceptual advance for rPPG. 2. Strong performance across datasets: It shows consistent improvements in both intra-dataset and cross-dataset generalization.
1. Lack of comparison on model size: Integrating LLMs significantly inflates parameter count and computational cost compared to prior rPPG methods. Since remote physiological sensing has strong real-time and mobile deployment requirements, storage and latency overhead are critical. The paper should provide model size, trainable parameter count, and inference efficiency comparisons with existing methods. 2. Effectiveness of semantic info: The LLM backbone is positioned as the central source of i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Non-Invasive Vital Sign Monitoring · ECG Monitoring and Analysis
