TL;DR
MoCapAnything is a unified framework that enables 3D motion capture from monocular videos for arbitrary skeletons, using a reference-guided, factorized approach with a new dataset and cross-species retargeting capabilities.
Contribution
It introduces a novel, category-agnostic motion capture system that reconstructs animations for any rigged asset from monocular videos, advancing flexibility and scalability.
Findings
Achieves high-quality skeletal animations across diverse rigs.
Demonstrates effective cross-species retargeting in in-the-wild videos.
Outperforms existing methods on in-domain benchmarks.
Abstract
Motion capture now underpins content creation far beyond digital humans, yet most existing pipelines remain species- or template-specific. We formalize this gap as Category-Agnostic Motion Capture (CAMoCap): given a monocular video and an arbitrary rigged 3D asset as a prompt, the goal is to reconstruct a rotation-based animation such as BVH that directly drives the specific asset. We present MoCapAnything, a reference-guided, factorized framework that first predicts 3D joint trajectories and then recovers asset-specific rotations via constraint-aware inverse kinematics. The system contains three learnable modules and a lightweight IK stage: (1) a Reference Prompt Encoder that extracts per-joint queries from the asset's skeleton, mesh, and rendered images; (2) a Video Feature Extractor that computes dense visual descriptors and reconstructs a coarse 4D deforming mesh to bridge the gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
