RePLan: Robotic Replanning with Perception and Language Models
Marta Skreta, Zihan Zhou, Jia Lin Yuan, Kourosh Darvish, Al\'an, Aspuru-Guzik, Animesh Garg

TL;DR
RePLan introduces a novel framework that leverages vision-language models for online replanning in robotics, enabling robots to adapt to unforeseen obstacles and achieve long-horizon goals effectively.
Contribution
The paper presents RePLan, a new approach integrating VLMs for real-time replanning in robotic tasks, bridging high-level reasoning with low-level control.
Findings
RePLan successfully adapts to unforeseen obstacles in long-horizon tasks.
The framework outperforms baseline models in dynamic environments.
RePLan is applicable to real-world robotic systems.
Abstract
Advancements in large language models (LLMs) have demonstrated their potential in facilitating high-level reasoning, logical reasoning and robotics planning. Recently, LLMs have also been able to generate reward functions for low-level robot actions, effectively bridging the interface between high-level planning and low-level robot control. However, the challenge remains that even with syntactically correct plans, robots can still fail to achieve their intended goals due to imperfect plans or unexpected environmental issues. To overcome this, Vision Language Models (VLMs) have shown remarkable success in tasks such as visual question answering. Leveraging the capabilities of VLMs, we present a novel framework called Robotic Replanning with Perception and Language Models (RePLan) that enables online replanning capabilities for long-horizon tasks. This framework utilizes the physical…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The overall proposed method seems to be interesting and probably could work for a large variety of planning tasks. 2. The tasks are a bit more complex then simple pick and place and illustrate more complex reasoning. ================== I am increasing the score to 6 after rebuttal discussions.
1. the authors need to provide more environments or tasks to show the robustness of their method. The authors can use LLMs to generate tasks which are long-horizon to come up with more varieties of task so that the proposed method could be more thoroughly tested. I think this remains to be verified.
- The paper is well-written and easy to follow - The motivation of the paper is topical since feedback and adaptive replanning is important for foundation models which may hallucinate or require grounding in physical interactions - The method does not require additional human input compared to the baseline method Language2Reward (just one human input at the beggining, the rest of the replanning and execution is autonomously completed by the foundation model submodules)
- The VLM Perceiver is one of the most critical parts of the method, but it is not sufficiently explained. Due to the lack of details, I can only assume how it is utilized based on Algorithm 1, in which case I have some major concerns. Since the VLM is the bottleneck for providing feedback for grounding LLM plans and rewards for future LLM iterations. However, details are not shared about how the Perceiver is used, even though it is mentioned that "The High-Level Planner [is used] to decide what
- The presented idea is clear and well-motivated — leveraging LLMs in a hierarchical framework for both high-level task planning and low-level motion planning. Compared to prior work, “Language to Reward”, it is clear that such hierarchical approach is needed for long-horizon tasks and can also offer additional robustness as the system can replan its high-level action. - The literature review is also thorough, covering many recent works in this domain. However, this part can be improved because
- Currently the biggest limitation seems to be the lack of thorough experiments, which can use some improvement along two axes. One is the breadth of the tasks: there are only four tasks investigated in this work while there are also quite some similarities between them. An important advantage of using LLMs is that it is possible to apply to a wider set of tasks more easily. The other axis is the quantitative evaluation: currrently only 3 runs are performed for each entry in Table 1, which makes
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning
