Masked Modeling for Human Motion Recovery Under Occlusions

Zhiyin Qian; Siwei Zhang; Bharat Lal Bhatnagar; Federica Bogo; Siyu Tang

arXiv:2601.16079·cs.CV·January 26, 2026

Masked Modeling for Human Motion Recovery Under Occlusions

Zhiyin Qian, Siwei Zhang, Bharat Lal Bhatnagar, Federica Bogo, Siyu Tang

PDF

Open Access

TL;DR

This paper introduces MoRo, a generative masked modeling framework that robustly reconstructs human motion from monocular videos under occlusions, combining multi-modal priors for accurate, real-time performance.

Contribution

MoRo is the first end-to-end generative approach leveraging masked modeling and multi-modal priors for occlusion-robust human motion recovery from RGB videos.

Findings

01

MoRo outperforms state-of-the-art methods in accuracy and realism under occlusions.

02

MoRo achieves real-time inference at 70 FPS on a single GPU.

03

MoRo performs comparably to existing methods in non-occluded scenarios.

Abstract

Human motion reconstruction from monocular videos is a fundamental challenge in computer vision, with broad applications in AR/VR, robotics, and digital content creation, but remains challenging under frequent occlusions in real-world settings. Existing regression-based methods are efficient but fragile to missing observations, while optimization- and diffusion-based approaches improve robustness at the cost of slow inference speed and heavy preprocessing steps. To address these limitations, we leverage recent advances in generative masked modeling and present MoRo: Masked Modeling for human motion Recovery under Occlusions. MoRo is an occlusion-robust, end-to-end generative framework that formulates motion reconstruction as a video-conditioned task, and efficiently recover human motion in a consistent global coordinate system from RGB videos. By masked modeling, MoRo naturally handles…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis