Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery

Jiaxin Cen; Xudong Mao; Guanghui Yue; Wei Zhou; Ruomei Wang; Fan Zhou; Baoquan Zhao

arXiv:2602.04257·cs.CV·February 5, 2026

Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery

Jiaxin Cen, Xudong Mao, Guanghui Yue, Wei Zhou, Ruomei Wang, Fan Zhou, Baoquan Zhao

PDF

Open Access

TL;DR

This paper introduces a depth-guided framework for monocular video human mesh recovery, improving metric consistency and temporal stability by integrating depth cues with RGB features through novel modules.

Contribution

It proposes a comprehensive depth-guided approach with three modules that enhance geometric integration, scale consistency, and temporal coherence in human mesh recovery from monocular videos.

Findings

01

Outperforms existing methods on challenging benchmarks

02

Improves robustness against occlusion and depth ambiguities

03

Maintains computational efficiency

Abstract

Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Advanced Vision and Imaging