TL;DR
DeVI is a framework that uses synthetic, text-conditioned videos to enable physically plausible dexterous human-object interactions in robotics, overcoming the limitations of purely 2D generative videos.
Contribution
DeVI introduces a hybrid tracking reward and zero-shot generalization, enabling dexterous manipulation control using synthetic videos without requiring 3D demonstrations.
Findings
DeVI outperforms existing imitation methods in dexterous hand-object interactions.
It effectively generalizes to unseen objects and interaction types.
DeVI demonstrates success in multi-object scenes and diverse, text-driven actions.
Abstract
Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
