Scaling Wearable Foundation Models
Girish Narayanswamy, Xin Liu, Kumar Ayush, Yuzhe Yang, Xuhai Xu, Shun, Liao, Jake Garrison, Shyam Tailor, Jake Sunshine, Yun Liu, Tim Althoff,, Shrikanth Narayanan, Pushmeet Kohli, Jiening Zhan, Mark Malhotra, Shwetak, Patel, Samy Abdel-Ghaffar, Daniel McDuff

TL;DR
This paper explores the scaling properties of a large multimodal wearable sensor foundation model, LSM, demonstrating its effectiveness in various predictive tasks and its potential for efficient downstream learning.
Contribution
It introduces LSM, the largest wearable sensor foundation model to date, and investigates its scaling laws across compute, data, and model size.
Findings
LSM improves performance on imputation, interpolation, and extrapolation tasks.
LSM enables sample-efficient downstream activity recognition.
Scaling laws for wearable sensor models are established.
Abstract
Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data; however, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of…
Peer Reviews
Decision·ICLR 2025 Poster
Strengths: 1. The experiments are conducted on a large scale, and modeling a variety of signals including activities of heart, skin, and motion. 2. This work empirically shows that the scaling law of modeling can also be applied to the modality of wearable signals, in terms of scaling up computability, size of dataset, and size of model. 3. The work includes reasonable baseline comparison with vision-based models.
Weaknesses: 1. The model has fixed shape input, which raises concern about generalization. For example, it is very common that different devices and scenarios have different sets of sensors with different sampling rate. Some wearables have PPG + inertial sensor data only. Some have other more sophisticated sensors like Galvanic Skin Response and Electrocardiogram. It’s unclear exactly how such a model would work with different sets of input modalities with diverse sampling and data resolution se
1. Several tasks have been defined for multimodal model evaluation: imputation, interpolation 2. A robust set of activities and extensive dataset for training 165k people. 3. Comparative analysis across model parameters and data size through an ablation study. 4. Analysis of multiple tasks: Classification, interpolation, reputation and extrapolation.
1. 16 features extracted from PPG, Acceleration, Skin temperature and conductance, and altimetry - seems a very reduced space for learning. 2. Temporal interpolation at the scale of 1 minute is not a very accurate task for wearable human data, as it only holds under strong assumptions of human behaviour. E.g. continuous activity, unchanged environment, no external inputs, among others. 3. It needed to be clarified the number of individuals used for training and the methods for data labelling.
- The data processing and model training steps are clearly outlined, and the study systematically scales the model across key factors such as data volume, computational resources, and model size. - The paper includes a variety of tasks that effectively demonstrate the model’s utility across different contexts, enhancing the applicability of the LSM model in both generative and classification domains.
**Areas for Improvement** - Background and Related Work: The paper is thin in covering prior works in multimodal wearable models. The chosen model (ViT) and baselines focus on non-wearable modalities, overlooking well-established approaches tailored to multimodal sensor data such as (but not limited to): > [1] Saeed, A., Ungureanu, V. and Gfeller, B., 2021. Sense and learn: Self-supervision for omnipresent sensors. Machine Learning with Applications, 6, p.100152. > [2] Deldari, S., Spathis, D
Videos
Taxonomy
TopicsArchitecture and Computational Design · BIM and Construction Integration
