Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation
Jiaqi Chen, Bingqian Lin, Xinmin Liu, Lin Ma, Xiaodan Liang, Kwan-Yee, K. Wong

TL;DR
This paper introduces AO-Planner, a novel zero-shot affordances-oriented planning method for continuous vision-language navigation, integrating foundation models for low-level control and high-level decision-making, achieving state-of-the-art results.
Contribution
The paper presents AO-Planner, combining foundation models for low-level motion planning and high-level reasoning in continuous VLN, bridging the gap between high-level task planning and low-level control.
Findings
Achieves 8.8% improvement on SPL in R2R-CE and RxR-CE datasets.
Can serve as a data annotator for pseudo-label generation.
Attains 47% success rate with a data-efficient predictor.
Abstract
LLM-based agents have demonstrated impressive zero-shot performance in vision-language navigation (VLN) task. However, existing LLM-based methods often focus only on solving high-level task planning by selecting nodes in predefined navigation graphs for movements, overlooking low-level control in navigation scenarios. To bridge this gap, we propose AO-Planner, a novel Affordances-Oriented Planner for continuous VLN task. Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making, both performed in a zero-shot setting. Specifically, we employ a Visual Affordances Prompting (VAP) approach, where the visible ground is segmented by SAM to provide navigational affordances, based on which the LLM selects potential candidate waypoints and plans low-level paths towards selected waypoints. We further propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
MethodsFocus · Segment Anything Model
