Learning Universal Policies via Text-Guided Video Generation
Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B., Tenenbaum, Dale Schuurmans, Pieter Abbeel

TL;DR
This paper proposes a novel approach to general-purpose decision making by generating future video frames conditioned on text goals, enabling flexible and transferable policies across tasks and environments.
Contribution
It introduces a text-guided video generation framework for decision making, allowing for natural goal specification and cross-domain generalization in AI agents.
Findings
Effective in generating goal-directed videos from text descriptions
Enables transfer learning using pretrained language models and internet videos
Demonstrates generalization across diverse robot manipulation tasks
Abstract
A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, after which control actions are extracted from the generated video. By leveraging text as the underlying goal specification, we are able to naturally and combinatorially generalize to novel goals. The proposed policy-as-video formulation can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
