TC-IDM: Grounding Video Generation for Executable Zero-shot Robot Motion
Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, Jian Tang

TL;DR
This paper introduces TC-IDM, a tool-centric inverse dynamics model that bridges visual planning and physical control in robot motion, enhancing generalization and success rates in zero-shot and deformable object tasks.
Contribution
The paper presents a novel tool-centric inverse dynamics approach that improves the translation of visual plans into executable robot motions, supporting diverse end-effectors and out-of-distribution tasks.
Findings
Achieves 61.11% success rate in real-world tasks.
Outperforms end-to-end VLA and inverse dynamics baselines.
Shows strong generalization to deformable objects and long-horizon tasks.
Abstract
The vision-language-action (VLA) paradigm has enabled powerful robotic control by leveraging vision-language models, but its reliance on large-scale, high-quality robot data limits its generalization. Generative world models offer a promising alternative for general-purpose embodied AI, yet a critical gap remains between their pixel-level plans and physically executable actions. To this end, we propose the Tool-Centric Inverse Dynamics Model (TC-IDM). By focusing on the tool's imagined trajectory as synthesized by the world model, TC-IDM establishes a robust intermediate representation that bridges the gap between visual planning and physical control. TC-IDM extracts the tool's point cloud trajectories via segmentation and 3D motion estimation from generated videos. Considering diverse tool attributes, our architecture employs decoupled action heads to project these planned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
