HiMAE: Hierarchical Masked Autoencoders Discover Resolution-Specific Structure in Wearable Time Series
Simon A. Lee, Cyrus Tanade, Hao Zhou, Juhyeon Lee, Megha Thukral, Minji Han, Rachel Choi, Md Sazzad Hissain Khan, Baiying Lu, Migyeong Gwak, Mehrab Bin Morshed, Viswam Nathan, Md Mahbubur Rahman, Li Zhu, Subramaniam Venkatraman, Sharanya Arcot Desai

TL;DR
HiMAE introduces a hierarchical masked autoencoder framework that learns multi-resolution embeddings from wearable sensor data, revealing scale-specific structures and enabling efficient, interpretable edge inference for health monitoring.
Contribution
The paper presents HiMAE, a novel hierarchical autoencoder that captures resolution-specific features in wearable time series, outperforming existing models and enabling real-time edge inference.
Findings
HiMAE outperforms state-of-the-art models across multiple benchmarks.
It produces interpretable, multi-resolution embeddings.
Achieves sub-millisecond inference on smartwatch CPUs.
Abstract
Wearable sensors provide abundant physiological time series, yet the principles governing their predictive utility remain unclear. We hypothesize that temporal resolution is a fundamental axis of representation learning, with different clinical and behavioral outcomes relying on structure at distinct scales. To test this resolution hypothesis, we introduce HiMAE (Hierarchical Masked Autoencoder), a self supervised framework that combines masked autoencoding with a hierarchical convolutional encoder decoder. HiMAE produces multi resolution embeddings that enable systematic evaluation of which temporal scales carry predictive signal, transforming resolution from a hyperparameter into a probe for interpretability. Across classification, regression, and generative benchmarks, HiMAE consistently outperforms state of the art foundation models that collapse scale, while being orders of…
Peer Reviews
Decision·ICLR 2026 Poster
1. High Practical Significance (On-Watch Inference): The paper's most compelling strength is its focus on and successful demonstration of a "true on-watch" SSL model. By creating a model that is ~99% smaller (1.2M vs 110M params) and significantly faster (0.99ms CPU latency) than transformer baselines, the authors present a practical path toward real-time, continuous health monitoring while preserving user privacy (since data does not need to leave the device). 2. Novel Interpretability Framewo
1. Poor Presentation and Figure Quality: The paper suffers from a clear lack of polish. For instance, the choice of color palette is a significant problem. In Figures 3, 4, 5, and 15, the colors for competing models are nearly indistinguishable (e.g., "PaPaGel-S", "SimCLR", "DINO", "MSN", "LSM", "HIMAE" are all shades of blue/teal/green). This makes it extremely difficult to review the paper's core performance claims. This is particularly problematic for presenting clear comparisons against base
The training set is large, with substantial hours and participants, which is good for alleviating subject-specific noise. The experiment is comprehensive, including different classification tasks, such as cardiovascular and sleep stages, which are the functionalities widely required for wearable devices. The model scale and total training time show the efficiency and applicability for wearable devices.
The method lacks novelty. Multi-scale learning is commonly used in medical/wearable time series representation learning, either through model-based or manual data preprocessing. Multi-scale learning using convolutional networks is more common in past research on time series analysis. Besides, it is better to have a comprehensive results table for comparison with baseline methods on different tasks. The results in Figure 5 are not straightforward enough.
- The idea that different downstream tasks may depend on representations at distinct temporal resolutions is both intuitive and well-motivated. Designing the architecture explicitly around this principle provides a clear and reasonable inductive bias for modeling physiological time series. - The proposed method is thoroughly evaluated across several downstream tasks, showing consistent improvements over large-scale baselines such as LSM while using substantially fewer parameters. - The emphasis
- While the idea of resolution as interpretability is interesting, the paper primarily focuses on downstream performance differences across embeddings obtained from different HiMAE layers. Although this evaluation setup is reasonable, it is not clear that it fully substantiates the claimed interpretability of HiMAE. Visualization analyses of embeddings or frequency-response characterizations across layers could strengthen the interpretability argument. - The authors could provide more intuition
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
