TL;DR
AGILE introduces a novel framework for reconstructing hand-object interactions from monocular videos by using agentic generation and robust tracking, overcoming occlusion and initialization challenges.
Contribution
It shifts from traditional reconstruction to agentic generation guided by vision-language models, enabling robust, complete, and physically plausible interaction reconstructions without fragile SfM.
Findings
Outperforms baselines in geometric accuracy.
Demonstrates robustness on challenging in-the-wild sequences.
Produces simulation-ready assets validated via real-to-sim retargeting.
Abstract
Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
