MAEPose: Self-Supervised Spatiotemporal Learning for Human Pose Estimation on mmWave Video
Xijia Wei, Yuan Fang, Kevin Chetty, Youngjun Cho, and Nadia Bianchi-Berthouze

TL;DR
MAEPose introduces a self-supervised, spatiotemporal learning method for human pose estimation directly from mmWave radar videos, outperforming existing approaches and reducing system complexity.
Contribution
It presents MAEPose, a novel masked autoencoding framework that learns from unlabelled radar videos and improves pose estimation accuracy without relying on intermediate representations.
Findings
Outperforms state-of-the-art by up to 22.1% in MPJPE
Maintains robustness under zero-shot bystander interference with only 6.5% error increase
Leverages Range-Doppler videos for better performance and lower computational cost
Abstract
Millimetre-wave (mmWave) radar offers a more privacy-preserving alternative to RGB-based human pose estimation. However, existing methods typically rely on pre-extracted intermediate representations such as sparse point clouds or spectrogram images, where the rich spatiotemporal information naturally present in radar video streams is discarded for model learning, while such signal processing adds system complexity. In addition, existing solutions are mainly conducted in an end-to-end supervised manner without leveraging unlabelled raw video streams to learn generalized representations. In this study, we present MAEPose, a masked autoencoding-based human pose estimation approach that operates directly on mmWave spectrogram videos. MAEPose learns spatiotemporal motion-aware generalized representations from unlabelled radar video, and leverages its heatmap decoder for multi-frame pose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
