OctoNav: Towards Generalist Embodied Navigation
Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, Si Liu

TL;DR
This paper introduces OctoNav, a comprehensive benchmark and method for developing generalist embodied navigation agents capable of following complex, free-form multi-modal instructions, with a focus on reasoning and multi-capability integration.
Contribution
It presents a large-scale benchmark, a novel method based on multi-modal large language models, and a hybrid training paradigm to advance generalist embodied navigation.
Findings
OctoNav-R1 outperforms previous methods in navigation tasks.
The hybrid training paradigm improves reasoning and action planning.
The benchmark enables diverse instruction-trajectory pairings for comprehensive evaluation.
Abstract
Embodied navigation stands as a foundation pillar within the broader pursuit of embodied AI. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task objectives and modalities, making datasets and methods are designed individually. In this work, we take steps toward generalist navigation agents, which can follow free-form instructions that include arbitrary compounds of multi-modal and multi-capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench features continuous environments and is constructed via a designed annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Social Robot Interaction and HRI
