Iterative Vision-and-Language Navigation
Jacob Krantz, Shurjo Banerjee, Wang Zhu, Jason Corso, Peter Anderson,, Stefan Lee, Jesse Thomason

TL;DR
This paper introduces IVLN, a new paradigm for vision-and-language navigation that emphasizes persistent memory across multiple episodes, better reflecting real-world robot deployment scenarios.
Contribution
It proposes the IVLN paradigm and benchmarks, highlighting the importance of map-building agents for environment persistence in VLN tasks.
Findings
Map-building agents benefit from environment persistence
Extending implicit memory alone is insufficient for IVLN
New benchmarks with 400 tours in 80 scenes introduced
Abstract
We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes, each defined by an individual language instruction and a target path. We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes. We find that extending the implicit memory of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech and dialogue systems
