MoBind: Motion Binding for Fine-Grained IMU-Video Pose Alignment
Duc Duy Nguyen, Tat-Jun Chin, Minh Hoai

TL;DR
MoBind introduces a hierarchical contrastive learning framework that effectively aligns IMU signals with video-based skeletal motion, enabling precise cross-modal retrieval, synchronization, and action recognition.
Contribution
It presents a novel multi-level contrastive approach for fine-grained IMU-video alignment, decomposing full-body motion into local parts for improved semantic and temporal accuracy.
Findings
Outperforms baselines on multiple datasets
Achieves robust fine-grained temporal alignment
Preserves semantic consistency across modalities
Abstract
We aim to learn a joint representation between inertial measurement unit (IMU) signals and 2D pose sequences extracted from video, enabling accurate cross-modal retrieval, temporal synchronization, subject and body-part localization, and action recognition. To this end, we introduce MoBind, a hierarchical contrastive learning framework designed to address three challenges: (1) filtering out irrelevant visual background, (2) modeling structured multi-sensor IMU configurations, and (3) achieving fine-grained, sub-second temporal alignment. To isolate motion-relevant cues, MoBind aligns IMU signals with skeletal motion sequences rather than raw pixels. We further decompose full-body motion into local body-part trajectories, pairing each with its corresponding IMU to enable semantically grounded multi-sensor alignment. To capture detailed temporal correspondence, MoBind employs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Balance, Gait, and Falls Prevention
