TL;DR
VIMCAN is a hybrid deep learning architecture that efficiently combines sequence modeling and spatial reasoning for real-time multimodal 3D human pose estimation using visual and inertial data.
Contribution
It introduces VIMCAN, a novel hybrid model that integrates Mamba's efficient temporal sequence processing with Cross-Attention's spatial reasoning for improved accuracy.
Findings
Achieves MPJPE of 17.2 mm on TotalCapture dataset.
Supports real-time inference at over 60 fps on consumer hardware.
Outperforms prior Transformer-based methods in multimodal 3D human pose estimation.
Abstract
The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
