This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang; Nikhil Sridhar; Chao Feng; Mark Van der Merwe; Adam Fishman; Nima Fazeli; Jeong Joon Park

arXiv:2407.05530·cs.RO·May 20, 2025

This&That: Language-Gesture Controlled Video Generation for Robot Planning

Boyang Wang, Nikhil Sridhar, Chao Feng, Mark Van der Merwe, Adam Fishman, Nima Fazeli, Jeong Joon Park

PDF

Open Access 2 Models 1 Video

TL;DR

This paper introduces This&That, a framework that uses language and gesture-conditioned video generation to improve robot planning and execution of complex tasks, outperforming previous methods.

Contribution

It presents a novel approach combining video generative models with language-gesture conditioning for clearer task communication and improved robot planning.

Findings

01

Outperforms prior state-of-the-art behavior cloning methods

02

Enables unambiguous task communication with simple instructions

03

Successfully translates visual plans into robot actions

Abstract

Clear, interpretable instructions are invaluable when attempting any complex task. Good instructions help to clarify the task and even anticipate the steps needed to solve it. In this work, we propose a robot learning framework for communicating, planning, and executing a wide range of tasks, dubbed This&That. This&That solves general tasks by leveraging video generative models, which, through training on internet-scale data, contain rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intent, and 3) translating visual plans into robot actions. This&That uses language-gesture conditioning to generate video predictions, as a succinct and unambiguous alternative to existing language-only methods, especially in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

This&That: Lerobot Tech Talk #7 by Jeong Joon Park· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition

MethodsDiffusion