Masked Modeling for Human Motion Recovery Under Occlusions
Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang

TL;DR
This paper introduces MoRo, a generative masked modeling framework that robustly reconstructs human motion from monocular videos under occlusions, combining multi-modal priors for accurate, real-time performance.
Contribution
MoRo is the first end-to-end generative approach leveraging masked modeling and multi-modal priors for occlusion-robust human motion recovery from RGB videos.
Findings
MoRo outperforms state-of-the-art methods in accuracy and realism under occlusions.
MoRo achieves real-time inference at 70 FPS on a single GPU.
MoRo performs comparably to existing methods in non-occluded scenarios.
Abstract
Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings. Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis
