AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking
Ziyi Kou, Ankit Kumar, Mia Huang, Taylor Niehues, Vatsal Mehta, Ergys Ristani, Li Guan

TL;DR
AVI-HT introduces an adaptive fusion method combining visual data and IMU signals for accurate 3D hand tracking, especially under occlusion, using a novel attention mechanism and extensive real-world data.
Contribution
The paper proposes a new adaptive vision-IMU fusion approach with a cross-sensor attention mechanism and a large annotated dataset for improved 3D hand tracking.
Findings
AVI-HT reduces mean keypoint error by 16.1%.
Wrist-aligned variant reduces error by 24.2%.
Ablation studies show finger-specific IMU contributions.
Abstract
We present AVI-HT, an adaptive visual-IMU fusion approach for tracking 3D hand poses by jointly modeling the egocentric image with on-glove 6-DoF IMU signals. AVI-HT achieves significantly improved accuracy and availability, particularly in hand-object interaction (HOI) scenarios involving heavy visual occlusion. Two complementary ingredients underpin its success: (1) synchronized multi-modal training data pairing on-body vision-IMU sensor streams with ground-truth 3D hand poses from a motion-capture system, and (2) a cross-sensor deep attention mechanism that adaptively modulates the trust assigned to the vision and individual IMU sensors. To evaluate AVI-HT in real-world settings, we conduct extensive experiments on our DexGloveHOI dataset that consists of 100K+ pairwise vision-IMU samples with synchronized 3D annotated poses, in which users manipulate a variety of objects during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
