Bridging the visual gap in VLN via semantically richer instructions
Joaquin Ossand\'on, Benjamin Earle, \'Alvaro Soto

TL;DR
This paper identifies that current VLN models underutilize visual information and proposes a data augmentation method using richer, object-based instructions to improve navigation success rates in unseen environments.
Contribution
The paper introduces a novel data augmentation technique that incorporates explicit visual object information into instructions, bridging the semantic gap in VLN datasets.
Findings
8% increase in success rate on unseen environments
State-of-the-art models overfit to textual instructions without visual data
Enhanced instructions lead to better visual understanding in VLN models
Abstract
The Visual-and-Language Navigation (VLN) task requires understanding a textual instruction to navigate a natural indoor environment using only visual information. While this is a trivial task for most humans, it is still an open problem for AI models. In this work, we hypothesize that poor use of the visual information available is at the core of the low performance of current models. To support this hypothesis, we provide experimental evidence showing that state-of-the-art models are not severely affected when they receive just limited or even no visual data, indicating a strong overfitting to the textual instructions. To encourage a more suitable use of the visual information, we propose a new data augmentation method that fosters the inclusion of more explicit visual information in the generation of textual navigational instructions. Our main intuition is that current VLN datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
