Compositional Foundation Models for Hierarchical Planning
Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi, Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, Pulkit Agrawal

TL;DR
This paper introduces HiP, a hierarchical planning framework that combines language, vision, and action models to improve decision-making in complex, long-horizon tasks through symbolic planning, visual reasoning, and visual-motor control.
Contribution
The paper presents a novel compositional foundation model that integrates multiple expert models for hierarchical planning in long-horizon tasks, enabling effective reasoning and execution.
Findings
Successful application to three long-horizon table-top manipulation tasks
Effective grounding of symbolic plans in visual and motor control
Enhanced hierarchical reasoning through iterative model refinement
Abstract
To make effective decisions in novel environments with long-horizon goals, it is crucial to engage in hierarchical reasoning across spatial and temporal scales. This entails planning abstract subgoal sequences, visually reasoning about the underlying plans, and executing actions in accordance with the devised plan through visual-motor control. We propose Compositional Foundation Models for Hierarchical Planning (HiP), a foundation model which leverages multiple expert foundation model trained on language, vision and action data individually jointly together to solve long-horizon tasks. We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model. Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos. To enable effective reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Games · Human Pose and Action Recognition
MethodsDiffusion
