"Where am I?" Scene Retrieval with Language

Jiaqi Chen; Daniel Barath; Iro Armeni; Marc Pollefeys; Hermann Blum

arXiv:2404.14565·cs.CV·November 11, 2024

"Where am I?" Scene Retrieval with Language

Jiaqi Chen, Daniel Barath, Iro Armeni, Marc Pollefeys, Hermann Blum

PDF

Open Access

TL;DR

This paper introduces a method for scene retrieval using natural language queries by matching text descriptions with 3D scene graphs, enabling better natural language interaction with embodied AI agents.

Contribution

It proposes Text2SceneGraphMatcher, a novel pipeline that learns joint embeddings for matching language descriptions with scene graphs for scene retrieval.

Findings

01

Successful matching of language queries to 3D scene graphs

02

Code and models will be publicly available

03

Advances natural language interface for embodied AI

Abstract

Natural language interfaces to embodied AI are becoming more ubiquitous in our daily lives. This opens up further opportunities for language-based interaction with embodied agents, such as a user verbally instructing an agent to execute some task in a specific location. For example, "put the bowls back in the cupboard next to the fridge" or "meet me at the intersection under the red sign." As such, we need methods that interface between natural language and map representations of the environment. To this end, we explore the question of whether we can use an open-set natural language query to identify a scene represented by a 3D scene graph. We define this task as "language-based scene-retrieval" and it is closely related to "coarse-localization," but we are instead searching for a match from a collection of disjoint scenes and not necessarily a large-scale continuous map. We present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques