SensorLLM: Aligning Large Language Models with Motion Sensors for Human Activity Recognition
Zechen Li, Shohreh Deldari, Linyao Chen, Hao Xue, Flora D. Salim

TL;DR
SensorLLM introduces a novel two-stage framework that enables large language models to effectively perform human activity recognition from sensor time-series data by aligning sensor inputs with semantic descriptions and fine-tuning for classification.
Contribution
The paper presents a new method for aligning sensor data with language descriptions and fine-tuning LLMs for HAR, overcoming previous limitations in processing numerical sensor data.
Findings
SensorLLM achieves state-of-the-art HAR performance.
The approach generalizes across multiple datasets.
It effectively captures sensor data semantics without human annotations.
Abstract
We introduce SensorLLM, a two-stage framework that enables Large Language Models (LLMs) to perform human activity recognition (HAR) from sensor time-series data. Despite their strong reasoning and generalization capabilities, LLMs remain underutilized for motion sensor data due to the lack of semantic context in time-series, computational constraints, and challenges in processing numerical inputs. SensorLLM addresses these limitations through a Sensor-Language Alignment stage, where the model aligns sensor inputs with trend descriptions. Special tokens are introduced to mark channel boundaries. This alignment enables LLMs to capture numerical variations, channel-specific features, and data of varying durations, without requiring human annotations. In the subsequent Task-Aware Tuning stage, we refine the model for HAR classification, achieving performance that matches or surpasses…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Open-source code already available 2. Comprehensive evaluation in 4 datasets evaluating meaningful language representations from test sets. 3. The appendix was well developed, explaining the SOTA, details of datasets and many examples of results provided in their model. 4. Very well-implemented set of metrics for evaluation of sensor data understanding.
1. The architecture description of their task-aware model was not comprehensive. A diagram of each architecture would help. 2. Innovative points on the architecture could be better explained. Although explained sometimes it is not clear what is the contribution of their own architecture. 3. A slight lack of explanation of the process followed by the 5 human experts - it would be nice to have a short description of how the data was presented and evaluated, as well as, who these experts are (level
The paper's strengths are following: - The paper is well-written and the approach is described in detail. It was relatively easy to follow what authors are proposing and how they implemented it. - Authors do a good job in comparing their approach with a variety of state-of-the-art HAR models in the literature. The comparison plots are clear. - I think, the problem that authors are trying solve (enabling text-only LLMs to understand time-series sensor data) is an important and relevant prob
There are some major weaknesses in author's approach as listed below: * It is not at all clear what value LLM (Llama-3 8b) is adding in the author's approach: Author's essentially use frozen TS-embedding model (Chronos) + fine-tuned MLP + frozen LLM (Llama-3) as a feature extractor, followed by fine-tuned MLP as an HAR classifier. I think, it is a significant overkill to use a versatile model like Llama-3-8b within a fixed multi-class HAR classifier. There is no exploration of zero-shot clas
The work demonstrates limited originality in terms of attempting to fuse time series data with language model inference. It readily draws upon existing time series encoding paradigm and LLMs with limited training novelty via the MLPs. It address a practical problem by focusing on human activity recognition which has a broad spectrum of potential applications. The treatment is somewhat rigorous in terms of ablations and comparisons to prior art. There do remain considerable areas of concern and g
The article purports computational complexity as a reason for pursuing this approach over classic time series. However it seems to not acknowledge the fact that the LLM as well as the TS encoder have required significant computational resources already. Further, the the ability of LLMs to comprehend TS embedding derived trends in a quantifiable sense remains a key open. The training data representation and is quality determines success at this task and there is no guarantee established thus far
Code & Models
Videos
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Human Pose and Action Recognition
MethodsALIGN
