SensorLM: Learning the Language of Wearable Sensors

Yuwei Zhang; Kumar Ayush; Siyuan Qiao; A. Ali Heydari; Girish Narayanswamy; Maxwell A. Xu; Ahmed A. Metwally; Shawn Xu; Jake Garrison; Xuhai Xu; Tim Althoff; Yun Liu; Pushmeet Kohli; Jiening Zhan; Mark Malhotra; Shwetak Patel; Cecilia Mascolo; Xin Liu; Daniel McDuff; Yuzhe Yang

arXiv:2506.09108·cs.LG·June 12, 2025

SensorLM: Learning the Language of Wearable Sensors

Yuwei Zhang, Kumar Ayush, Siyuan Qiao, A. Ali Heydari, Girish Narayanswamy, Maxwell A. Xu, Ahmed A. Metwally, Shawn Xu, Jake Garrison, Xuhai Xu, Tim Althoff, Yun Liu, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak Patel, Cecilia Mascolo, Xin Liu, Daniel McDuff, Yuzhe Yang

PDF

Open Access

TL;DR

SensorLM introduces a new foundation model for wearable sensor data that leverages natural language to improve understanding, annotation, and cross-modal retrieval, significantly advancing human activity analysis and healthcare applications.

Contribution

It develops the largest sensor-language dataset and extends multimodal pretraining architectures to sensor data, enabling zero-shot, few-shot, and cross-modal tasks.

Findings

01

Outperforms state-of-the-art in zero-shot and few-shot recognition

02

Demonstrates effective sensor captioning and zero-shot generalization

03

Scales well with data and improves label efficiency

Abstract

We present SensorLM, a family of sensor-language foundation models that enable wearable sensor data understanding with natural language. Despite its pervasive nature, aligning and interpreting sensor data with language remains challenging due to the lack of paired, richly annotated sensor-text descriptions in uncurated, real-world wearable data. We introduce a hierarchical caption generation pipeline designed to capture statistical, structural, and semantic information from sensor data. This approach enabled the curation of the largest sensor-language dataset to date, comprising over 59.7 million hours of data from more than 103,000 people. Furthermore, SensorLM extends prominent multimodal pretraining architectures (e.g., CLIP, CoCa) and recovers them as specific variants within a generic architecture. Extensive experiments on real-world tasks in human activity analysis and healthcare…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Context-Aware Activity Recognition Systems