Zero-shot Reconstruction of In-Scene Object Manipulation from Video

Dixuan Lin; Tianyou Wang; Zhuoyang Pan; Yufu Wang; Lingjie Liu; Kostas Daniilidis

arXiv:2512.19684·cs.CV·December 23, 2025

Zero-shot Reconstruction of In-Scene Object Manipulation from Video

Dixuan Lin, Tianyou Wang, Zhuoyang Pan, Yufu Wang, Lingjie Liu, Kostas Daniilidis

PDF

Open Access

TL;DR

This paper introduces a novel system that reconstructs in-scene object manipulation from monocular RGB videos, leveraging foundation models and a two-stage optimization to recover detailed hand-object interactions.

Contribution

It is the first to combine data-driven initialization with optimization for accurate, physically plausible in-scene object manipulation reconstruction from monocular video.

Findings

01

Achieves detailed hand-object motion reconstruction from monocular videos.

02

Outperforms existing methods by incorporating scene context.

03

Provides a complete pipeline from initialization to optimization.

Abstract

We build the first system to address the problem of reconstructing in-scene object manipulation from a monocular RGB video. It is challenging due to ill-posed scene reconstruction, ambiguous hand-object depth, and the need for physically plausible interactions. Existing methods operate in hand centric coordinates and ignore the scene, hindering metric accuracy and practical use. In our method, we first use data-driven foundation models to initialize the core components, including the object mesh and poses, the scene point cloud, and the hand poses. We then apply a two-stage optimization that recovers a complete hand-object motion from grasping to interaction, which remains consistent with the scene information observed in the input video.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Hand Gesture Recognition Systems