Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Jiahe Chen; ZiRui Wang; Feiyu Jia; Xiao Chen; Xiaojie Niu; Weishuai Zeng; Tianfan Xue; Xiaowei Zhou; Jiangmiao Pang; Jingbo Wang

arXiv:2605.22272·cs.RO·May 22, 2026

Imagine2Real: Towards Zero-shot Humanoid-Object Interaction via Video Generative Priors

Jiahe Chen, ZiRui Wang, Feiyu Jia, Xiao Chen, Xiaojie Niu, Weishuai Zeng, Tianfan Xue, Xiaowei Zhou, Jiangmiao Pang, Jingbo Wang

PDF

TL;DR

Imagine2Real is a zero-shot humanoid-object interaction framework that leverages video generative priors and sparse keypoint tracking to enable flexible, geometry-free, and physically deployable interactions.

Contribution

It introduces a novel zero-shot HOI method that addresses representation misalignment and retargeting complexity without relying on geometric priors.

Findings

01

Enables zero-shot physical deployment of humanoid-object interactions.

02

Uses sparse keypoints and behavior models to maintain natural gaits.

03

Achieves robust behaviors with simple training rewards.

Abstract

Whole-body Humanoid-Object Interaction (HOI) is bottlenecked by the scarcity of high-fidelity 3D data. While video generative priors offer a promising alternative, existing methods suffer from \textit{Representation Misalignment} due to their reliance on geometric priors (e.g., explicit CAD models), and \textit{Retargeting Complexity} arising from intensive morphing and morphological mismatch. We propose Imagine2Real, a zero-shot HOI framework for flexible, geometry-free interaction. To resolve misalignment, we formulate robot and object motions as unified 4D point trajectories. To overcome retargeting complexity, our Keypoints Tracker tracks only sparse critical points (base, hands, and object), entirely bypassing the error-amplifying retargeting process. To maintain natural gaits despite these sparse signals, we utilize the latent space of a Behavior Foundation Model (BFM) as the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.