VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

Zepeng Yang; Junxuan Bai; Hao Li; Ju Dai; Junjun Pan; Yongfeng Yin; Bin Li

arXiv:2605.07552·cs.CV·May 13, 2026

VIMCAN: Visual-Inertial 3D Human Pose Estimation with Hybrid Mamba-Cross-Attention Network

Zepeng Yang, Junxuan Bai, Hao Li, Ju Dai, Junjun Pan, Yongfeng Yin, Bin Li

PDF

1 Repo

TL;DR

VIMCAN is a hybrid deep learning architecture that efficiently combines sequence modeling and spatial reasoning for real-time multimodal 3D human pose estimation using visual and inertial data.

Contribution

It introduces VIMCAN, a novel hybrid model that integrates Mamba's efficient temporal sequence processing with Cross-Attention's spatial reasoning for improved accuracy.

Findings

01

Achieves MPJPE of 17.2 mm on TotalCapture dataset.

02

Supports real-time inference at over 60 fps on consumer hardware.

03

Outperforms prior Transformer-based methods in multimodal 3D human pose estimation.

Abstract

The rapid advances in deep learning have significantly enhanced the accuracy of multimodal 3D human pose estimation (HPE). However, the state-of-the-art (SOTA) HPE pipelines still rely on Transformers, whose quadratic complexity makes real-time processing for long sequences impractical. Mamba addresses this issue through selective state-space modeling, enabling efficient sequence processing without sacrificing representational power. Nevertheless, it struggles to capture complex spatial dependencies in multimodal settings. To bridge this gap, we propose VIMCAN, a hybrid architecture that combines the efficient sequence modeling of Mamba with the spatial reasoning of Cross-Attention, and performs robust visual-inertial fusion and human pose estimation between RGB keypoints and wearable IMU data. By leveraging Mamba's dynamic parameterization for temporal modeling and Attention for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Eddieyzp/VIMCAN
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.