Vision-Language Navigation with Random Environmental Mixup
Chong Liu, Fengda Zhu, Xiaojun Chang, Xiaodan Liang and, Zongyuan Ge, Yi-Dong Shen

TL;DR
This paper introduces Random Environmental Mixup (REM), a novel data augmentation technique for Vision-Language Navigation that creates cross-connected house scenes to improve agent generalization to unseen environments.
Contribution
The paper proposes REM, a new data augmentation method that explicitly reduces data bias across different house scenes in VLN tasks, enhancing generalization to unseen environments.
Findings
REM improves navigation performance in unseen scenes.
The approach reduces the performance gap between seen and unseen environments.
Our model achieves state-of-the-art results on VLN benchmarks.
Abstract
Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction. Large data bias, which is caused by the disparity ratio between the small data scale and large navigation space, makes the VLN task challenging. Previous works have proposed various data augmentation methods to reduce data bias. However, these works do not explicitly reduce the data bias across different house scenes. Therefore, the agent would overfit to the seen scenes and achieve poor navigation performance in the unseen scenes. To tackle this problem, we propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment. Specifically, we first select key viewpoints according to the room connection graph for each scene. Then, we cross-connect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsConvolution · Dense Connections · Q-Learning · Deep Q-Network · Random Ensemble Mixture · Mixup
