Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
Nikolaos Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, Christopher, Atkeson, Katerina Fragkiadaki

TL;DR
This paper introduces an energy-based, zero-shot planning framework for scene rearrangement that interprets complex, compositional language instructions to guide robotic manipulation, demonstrating superior performance over existing methods.
Contribution
It presents a novel energy function approach to interpret and execute compositional language instructions for scene rearrangement, generalizing to unseen instructions and concepts.
Findings
Outperforms language-to-action reactive policies and LLM planners
Successfully executes highly compositional instructions zero-shot in simulation and real world
Achieves significant improvements on instruction-guided manipulation benchmarks
Abstract
Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsTest
