Learning Human Motion from Monocular Videos via Cross-Modal Manifold Alignment
Shuaiying Hou, Hongyu Tao, Junheng Fang, Changqing Zou, Hujun Bao,, Weiwei Xu

TL;DR
This paper introduces VTM, a novel method for reconstructing 3D human motion from monocular videos by aligning 3D motion and 2D inputs in a shared latent space, effectively handling ambiguities and variations.
Contribution
The paper proposes a cross-modal latent space alignment approach that models upper and lower body motions separately and uses a scale-invariant skeleton for robust 3D motion reconstruction.
Findings
Achieves state-of-the-art results on AIST++ dataset.
Generalizes well to unseen view angles.
Performs effectively on in-the-wild videos.
Abstract
Learning 3D human motion from 2D inputs is a fundamental task in the realms of computer vision and computer graphics. Many previous methods grapple with this inherently ambiguous task by introducing motion priors into the learning process. However, these approaches face difficulties in defining the complete configurations of such priors or training a robust model. In this paper, we present the Video-to-Motion Generator (VTM), which leverages motion priors through cross-modal latent feature space alignment between 3D human motion and 2D inputs, namely videos and 2D keypoints. To reduce the complexity of modeling motion priors, we model the motion data separately for the upper and lower body parts. Additionally, we align the motion data with a scale-invariant virtual skeleton to mitigate the interference of human skeleton variations to the motion priors. Evaluated on AIST++, the VTM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Human Motion and Animation
MethodsALIGN
