MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism
Yanyi Qu, Haoyang Ma, and Wenhui Xiong

TL;DR
MultiFormer is a novel wireless sensing system that leverages Transformer-based feature extraction and multi-stage fusion to improve multi-person human pose estimation accuracy from CSI data, especially for high-mobility keypoints.
Contribution
The paper introduces MultiFormer, combining a Transformer-based dual-token feature extractor with a multi-stage fusion network for enhanced CSI-based pose estimation.
Findings
Achieves higher accuracy than state-of-the-art methods on public and self-collected datasets.
Effectively models inter-subcarrier correlations and temporal dependencies in CSI.
Improves estimation of high-mobility keypoints like wrists and elbows.
Abstract
Human pose estimation based on Channel State Information (CSI) has emerged as a promising approach for non-intrusive and precise human activity monitoring, yet faces challenges including accurate multi-person pose recognition and effective CSI feature learning. This paper presents MultiFormer, a wireless sensing system that accurately estimates human pose through CSI. The proposed system adopts a Transformer based time-frequency dual-token feature extractor with multi-head self-attention. This feature extractor is able to model inter-subcarrier correlations and temporal dependencies of the CSI. The extracted CSI features and the pose probability heatmaps are then fused by Multi-Stage Feature Fusion Network (MSFN) to enforce the anatomical constraints. Extensive experiments conducted on on the public MM-Fi dataset and our self-collected dataset show that the MultiFormer achieves higher…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Context-Aware Activity Recognition Systems
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
