Self-supervised Learning of Motion Capture
Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, Katerina Fragkiadaki

TL;DR
This paper introduces a learning-based motion capture model from a single RGB camera that combines supervised training with self-supervision via differentiable rendering, overcoming local minima issues of traditional optimization methods.
Contribution
It proposes a neural network approach that predicts 3D human shape and pose from monocular video, trained with synthetic data and self-supervision, reducing manual effort and improving accuracy.
Findings
Model outperforms traditional optimization methods.
Converges to low-error solutions with experience.
Adapts to test data via self-supervision.
Abstract
Current state-of-the-art solutions for motion capture from a single camera are optimization driven: they optimize the parameters of a 3D human model so that its re-projection matches measurements in the video (e.g. person segmentation, optical flow, keypoint detections etc.). Optimization models are susceptible to local minima. This has been the bottleneck that forced using clean green-screen like backgrounds at capture time, manual initialization, or switching to multiple cameras as input resource. In this work, we propose a learning based motion capture model for single camera input. Instead of optimizing mesh and skeleton parameters directly, our model optimizes neural network weights that predict 3D shape and skeleton configurations given a monocular RGB video. Our model is trained using a combination of strong supervision from synthetic data, and self-supervision from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Human Motion and Animation
