Robust and Interpretable Grounding of Spatial References with Relation Networks
Tsung-Yen Yang, Andrew S. Lan, Karthik Narasimhan

TL;DR
This paper introduces a relation network model for understanding spatial references in natural language, enhancing robustness and interpretability in tasks like navigation and manipulation.
Contribution
It proposes a dynamic, text-conditioned relation network with cross-modal attention for explicit reasoning over spatial entities, improving robustness and interpretability.
Findings
17% improvement in goal location prediction
15% enhancement in robustness over state-of-the-art
Effective in three diverse spatial understanding tasks
Abstract
Learning representations of spatial references in natural language is a key challenge in tasks like autonomous navigation and robotic manipulation. Recent work has investigated various neural architectures for learning multi-modal representations for spatial concepts. However, the lack of explicit reasoning over entities makes such approaches vulnerable to noise in input text or state observations. In this paper, we develop effective models for understanding spatial references in text that are robust and interpretable, without sacrificing performance. We design a text-conditioned \textit{relation network} whose parameters are dynamically computed with a cross-modal attention module to capture fine-grained spatial relations between entities. This design choice provides interpretability of learned intermediate outputs. Experiments across three tasks demonstrate that our model achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Geographic Information Systems Studies
MethodsInterpretability
