This&That: Language-Gesture Controlled Video Generation for Robot Planning
Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park

TL;DR
This paper introduces This&That, a framework that uses language and gesture-conditioned video generation to improve robot planning and execution of complex tasks, outperforming previous methods.
Contribution
It presents a novel approach combining video generative models with language-gesture conditioning for clearer task communication and improved robot planning.
Findings
Outperforms prior state-of-the-art behavior cloning methods
Enables unambiguous task communication with simple instructions
Successfully translates visual plans into robot actions
Abstract
Clear, interpretable instructions are invaluable when attempting any complex task. Good instructions help to clarify the task and even anticipate the steps needed to solve it. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intent, and 3) translating visual plans into robot actions. This&That uses language-gesture conditioning to generate video predictions, as a succinct and unambiguous alternative to existing language-only methods, especially in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
This&That: Lerobot Tech Talk #7 by Jeong Joon Park· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition
MethodsDiffusion
