"Where am I?" Scene Retrieval with Language
Jiaqi Chen, Daniel Barath, Iro Armeni, Marc Pollefeys, Hermann Blum

TL;DR
This paper introduces a method for scene retrieval using natural language queries by matching text descriptions with 3D scene graphs, enabling better natural language interaction with embodied AI agents.
Contribution
It proposes Text2SceneGraphMatcher, a novel pipeline that learns joint embeddings for matching language descriptions with scene graphs for scene retrieval.
Findings
Successful matching of language queries to 3D scene graphs
Code and models will be publicly available
Advances natural language interface for embodied AI
Abstract
Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
