General Scene Adaptation for Vision-and-Language Navigation
Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu

TL;DR
This paper introduces GSA-VLN, a new scene-adaptive VLN task with a diverse dataset and a memory-based navigation method, significantly improving agent performance in persistent environments.
Contribution
It proposes a novel scene adaptation task, a new dataset GSA-R2R, and a memory-based navigation method GR-DUET, advancing zero-shot and continual learning in VLN.
Findings
GR-DUET achieves state-of-the-art results on GSA-R2R.
The dataset GSA-R2R enhances evaluation of adaptability in VLN.
Three-stage instruction refinement improves instruction understanding.
Abstract
Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN, a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of OOD…
Peer Reviews
Decision·ICLR 2025 Poster
Novelty: The paper introduces a new task, GSA-VLN, which focuses on the long-term adaptation of agents within specific environments, a capability with significant potential for real-world applications. Dataset Contribution: The authors present the GSA-R2R dataset, which extends the existing R2R dataset by using GPT-4 and a three-stage method to generate instructions in various speaking styles. The dataset is divided into residential and non-residential environments, serving as in-distribution (
Please see the Questions section for detailed improvement suggestions and questions. I look forward to the authors' responses to these questions, as addressing these points could significantly clarify some of the paper's contributions and limitations. I am open to adjusting my score if the authors provide further insights or resolve the concerns raised above.
1. This paper introduces the novel General Scene Adaptation for Vision-and-Language Navigation (GSA-VLN) task, filling a critical gap in VLN research by focusing on adaptation in persistent environments. Rather than assuming agents will encounter only unseen environments, GSA-VLN models a more realistic scenario where agents learn and improve over time within a familiar setting. This shift in task formulation is both timely and innovative, especially as VLN moves toward practical applications. 2
1. The GR-DUET method involves a memory bank and a global graph that retains historical information across episodes. As the memory and graph size increase, the model’s computational requirements may grow significantly, particularly for long-term navigation in large environments. While the paper includes an environment-specific training strategy to limit graph expansion, providing an analysis of computational costs and potential trade-offs between memory retention and scalability would strengthen
1、This paper is written with details and clear presentations, easy to follow. 2、The author solves the VLN problem from a new perspective and divides the scenarios into Residential and Non-Residential.
1、 The novelty of this paper is limited. The GSA-VLN TASK proposed in the paper is still a standard VLN task. The so-called “standard VLN task” mentioned by the paper also includes fine-tuning based on historical information and trained models, which are claimed as the novelty of GSA-VLN in Section 3.2. 2、 Following the previous comment, the GSA-R2R DATASET proposed in the paper uses more environments (HM3D), and then uses tools such as LLM to refine the dataset's quality, which has been a com
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
