Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery
Jiaxin Cen, Xudong Mao, Guanghui Yue, Wei Zhou, Ruomei Wang, Fan Zhou, Baoquan Zhao

TL;DR
This paper introduces a depth-guided framework for monocular video human mesh recovery, improving metric consistency and temporal stability by integrating depth cues with RGB features through novel modules.
Contribution
It proposes a comprehensive depth-guided approach with three modules that enhance geometric integration, scale consistency, and temporal coherence in human mesh recovery from monocular videos.
Findings
Outperforms existing methods on challenging benchmarks
Improves robustness against occlusion and depth ambiguities
Maintains computational efficiency
Abstract
Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Advanced Vision and Imaging
