TL;DR
This paper introduces Structured Scene Memory (SSM), a novel architecture for vision-language navigation that enhances long-term planning and environment understanding, leading to state-of-the-art results on VLN benchmarks.
Contribution
The paper proposes SSM, a structured, compartmentalized memory system that captures environment layouts and supports global planning in VLN tasks.
Findings
Achieves state-of-the-art performance on R2R and R4R datasets.
Effectively captures environment layouts and disentangles visual and geometric cues.
Supports efficient, global navigation planning through frontier-exploration strategy.
Abstract
Recently, numerous algorithms have been developed to tackle the problem of vision-language navigation (VLN), i.e., entailing an agent to navigate 3D environments through following linguistic instructions. However, current VLN agents simply store their past experiences/observations as latent states in recurrent networks, failing to capture environment layouts and make long-term planning. To address these limitations, we propose a crucial architecture, called Structured Scene Memory (SSM). It is compartmentalized enough to accurately memorize the percepts during navigation. It also serves as a structured scene representation, which captures and disentangles visual and geometric cues in the environment. SSM has a collect-read controller that adaptively collects information for supporting current decision making and mimics iterative algorithms for long-range reasoning. As SSM provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
