Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen,, Yuan-Fang Wang, William Yang Wang, Lei Zhang

TL;DR
This paper introduces a Reinforced Cross-Modal Matching approach combined with Self-Supervised Imitation Learning to improve vision-language navigation, addressing grounding, feedback, and generalization challenges, achieving state-of-the-art results.
Contribution
The paper proposes a novel RL-based cross-modal matching framework and a self-supervised imitation learning method to enhance navigation performance and generalization in unseen environments.
Findings
RCM outperforms previous methods by 10% on SPL.
SIL reduces the success gap between seen and unseen environments from 19% to 4%.
State-of-the-art performance on VLN benchmark.
Abstract
Vision-language navigation (VLN) is the task of navigating an embodied agent to carry out natural language instructions inside real 3D environments. In this paper, we study how to address three critical challenges for this task: the cross-modal grounding, the ill-posed feedback, and the generalization problems. First, we propose a novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL). Particularly, a matching critic is used to provide an intrinsic reward to encourage global matching between instructions and trajectories, and a reasoning navigator is employed to perform cross-modal grounding in the local visual scene. Evaluation on a VLN benchmark dataset shows that our RCM model significantly outperforms previous methods by 10% on SPL and achieves the new state-of-the-art performance. To improve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
