TL;DR
RHINO is a three-step framework that reconstructs 3D humans, novel objects, and scenes from monocular videos, leveraging foundation models, motion estimation, and neural fields for accurate, physically plausible reconstructions.
Contribution
The paper introduces RHINO, a novel method that jointly reconstructs humans, unseen objects, and scenes from monocular videos, addressing occlusion and motion entanglement challenges.
Findings
RHINO outperforms state-of-the-art methods on novel-view synthesis.
The framework achieves accurate 4D reconstructions with physically plausible shapes.
Each stage of RHINO significantly improves reconstruction quality.
Abstract
Reconstructing people, objects, and their interactions in 3D is a long-standing goal for intelligent systems. Often the input is RGB video from a moving camera, making the task ill-posed; depth is ambiguous, humans and objects occlude each other, and camera and object motion entangle to create apparent motion. Most prior work addresses humans or objects in isolation, ignoring their interplay, or assumes known 3D shapes or cameras, which is impractical for real-world applications. We develop RHINO (Reconstructing Human Interactions with Novel Objects), a three-step framework that recovers in 3D a human, novel (unseen) manipulated object, and static scene in a common world frame from a monocular RGB video. First, we leverage 3D-aware foundation models to obtain cues that stabilize Structure-from-Motion (SfM) even for low-texture regions; this yields a coarse shape and apparent motion of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
