Language and Visual Entity Relationship Graph for Agent Navigation
Yicong Hong, Cristian Rodriguez-Opazo, Yuankai Qi, Qi Wu, Stephen, Gould

TL;DR
This paper introduces a novel graph-based approach that models relationships between language and visual entities to enhance agent navigation in real-world environments, significantly improving performance on benchmark datasets.
Contribution
It proposes a new Language and Visual Entity Relationship Graph and a message passing algorithm to better interpret complex instructions and environment perceptions in VLN tasks.
Findings
Achieves a new state-of-the-art SPL of 52% on R2R unseen split.
Improves SDTW from 13% to 34% on R4R dataset.
Demonstrates the effectiveness of relationship modeling in navigation accuracy.
Abstract
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects,and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Time Series Analysis and Forecasting · Topic Modeling
