Narrowing the Gap between Vision and Action in Navigation
Yue Zhang, Parisa Kordjamshidi

TL;DR
This paper introduces a low-level action decoder and semantic-aware waypoint predictor to bridge the gap between visual perception and low-level control in Vision and Language Navigation, improving navigation performance.
Contribution
It proposes a joint training approach for low-level actions and enhances waypoint prediction with semantic and obstacle information, addressing key limitations of existing VLN-CE methods.
Findings
Improved navigation performance on benchmark datasets.
Better grounding of visual views to low-level controls.
Enhanced waypoint prediction with semantic and obstacle awareness.
Abstract
The existing methods for Vision and Language Navigation in the Continuous Environment (VLN-CE) commonly incorporate a waypoint predictor to discretize the environment. This simplifies the navigation actions into a view selection task and improves navigation performance significantly compared to direct training using low-level actions. However, the VLN-CE agents are still far from the real robots since there are gaps between their visual perception and executed actions. First, VLN-CE agents that discretize the visual environment are primarily trained with high-level view selection, which causes them to ignore crucial spatial reasoning within the low-level action movements. Second, in these models, the existing waypoint predictors neglect object semantics and their attributes related to passibility, which can be informative in indicating the feasibility of actions. To address these two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTravel Writing and Literature · Historical Geography and Cartography
