TL;DR
This paper introduces MoCapAnything V2, an end-to-end motion capture framework that jointly learns pose and rotation estimation from monocular video, improving accuracy and efficiency over prior factorized methods.
Contribution
It presents the first fully end-to-end trainable system for arbitrary skeleton motion capture, incorporating a reference pose for better rotation prediction and direct joint position estimation.
Findings
Reduces rotation error from ~17° to ~10° and 6.54° on unseen skeletons.
Achieves ~20x faster inference than mesh-based pipelines.
Improves robustness and efficiency by predicting joint positions directly from video.
Abstract
Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
