TL;DR
ArtHOI is a novel framework that leverages foundation models to reconstruct 4D hand-articulated-object interactions from a single RGB video, addressing a significant challenge in the field.
Contribution
It introduces new methodologies, including Adaptive Sampling Refinement and MLLM-guided alignment, to improve accuracy and realism in monocular 4D reconstruction of articulated objects.
Findings
Robust reconstruction across diverse objects and interactions.
Effective optimization of object scale and pose from monocular videos.
Validated on new datasets with extensive experiments.
Abstract
Existing hand-object interactions (HOI) methods are largely limited to rigid objects, while 4D reconstruction methods of articulated objects generally require pre-scanning the object or even multi-view videos. It remains an unexplored but significant challenge to reconstruct 4D human-articulated-object interactions from a single monocular RGB video. Fortunately, recent advancements in foundation models present a new opportunity to address this highly ill-posed problem. To this end, we introduce ArtHOI, an optimization-based framework that integrates and refines priors from multiple foundation models. Our key contribution is a suite of novel methodologies designed to resolve the inherent inaccuracies and physical unreality of these priors. In particular, we introduce an Adaptive Sampling Refinement (ASR) method to optimize object's metric scale and pose for grounding its normalized mesh…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
