SIRI: Spatial Relation Induced Network For Spatial Description Resolution
Peiyao Wang, Weixin Luo, Yanyu Xu, Haojie Li, Shugong Xu, Jianyu Yang,, Shenghua Gao

TL;DR
This paper introduces SIRI, a novel network that models spatial relationships explicitly for language-guided localization in panoramic views, significantly improving accuracy over previous methods.
Contribution
The paper proposes a new spatial relationship induced network that mimics human spatial reasoning, incorporating object-level correlation, spatial relationship distillation, and global position priors.
Findings
Achieves 24% better accuracy than state-of-the-art on Touchdown dataset.
Effectively generalizes to an extended dataset with similar settings.
Improves spatial description resolution by explicit relationship modeling.
Abstract
Spatial Description Resolution, as a language-guided localization task, is proposed for target location in a panoramic street view, given corresponding language descriptions. Explicitly characterizing an object-level relationship while distilling spatial relationships are currently absent but crucial to this task. Mimicking humans, who sequentially traverse spatial relationship words and objects with a first-person view to locate their target, we propose a novel spatial relationship induced (SIRI) network. Specifically, visual features are firstly correlated at an implicit object-level in a projected latent space; then they are distilled by each spatial relationship word, resulting in each differently activated feature representing each spatial relationship. Further, we introduce global position priors to fix the absence of positional information, which may result in global positional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
