GHIL-Glue: Hierarchical Control with Filtered Subgoal Images
Kyle B. Hatch, Ashwin Balakrishna, Oier Mees, Suraj Nair, Seohong, Park, Blake Wulfe, Masha Itkina, Benjamin Eysenbach, Sergey Levine, Thomas, Kollar, and Benjamin Burchfiel

TL;DR
GHIL-Glue enhances hierarchical robot control by filtering and refining generative subgoals, significantly improving robustness and generalization in both simulated and real environments, and setting new benchmarks in language-conditioned manipulation tasks.
Contribution
Introduces GHIL-Glue, a novel interface that filters and improves generative subgoals, enhancing the integration of image/video prediction models with low-level policies.
Findings
25% performance improvement on CALVIN benchmark
Outperforms existing policies in zero-shot manipulation tasks
Achieves state-of-the-art results with RGB camera observations
Abstract
Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging
