Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation
Muhammad Zubair Irshad, Chih-Yao Ma, Zsolt Kira

TL;DR
This paper introduces Robo-VLN, a more realistic continuous environment for vision-and-language navigation, and proposes a hierarchical agent that significantly improves navigation performance over existing methods.
Contribution
It presents Robo-VLN, a new continuous 3D environment for VLN, and develops a hierarchical agent with layered decision-making that outperforms prior approaches.
Findings
Hierarchical agent outperforms baselines in Robo-VLN
Layered decision-making improves navigation accuracy
Modular training enhances policy effectiveness
Abstract
Deep Learning has revolutionized our ability to solve complex problems such as Vision-and-Language Navigation (VLN). This task requires the agent to navigate to a goal purely based on visual sensory inputs given natural language instructions. However, prior works formulate the problem as a navigation graph with a discrete action space. In this work, we lift the agent off the navigation graph and propose a more complex VLN setting in continuous 3D reconstructed environments. Our proposed setting, Robo-VLN, more closely mimics the challenges of real world navigation. Robo-VLN tasks have longer trajectory lengths, continuous action spaces, and challenges such as obstacles. We provide a suite of baselines inspired by state-of-the-art works in discrete VLN and show that they are less effective at this task. We further propose that decomposing the task into specialized high- and low-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
