Compositional Semantics for Open Vocabulary Spatio-semantic   Representations

Robin Karlsson; Francisco Lepe-Salazar; Kazuya Takeda

arXiv:2310.04981·cs.CV·October 10, 2023

Compositional Semantics for Open Vocabulary Spatio-semantic Representations

Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda

PDF

Open Access

TL;DR

This paper introduces a novel approach for representing complex spatio-semantic information in mobile robots using latent compositional semantic embeddings, enabling better reasoning and memory retrieval beyond immediate perception.

Contribution

The authors propose a mathematically grounded method for learning and discovering compositional semantic embeddings that improve open-vocabulary spatio-semantic reasoning in vision-language models.

Findings

01

z* embeddings can represent up to 10 semantics with SBERT and 100 in ideal conditions.

02

A simple VLM trained on COCO-Stuff learns z* for 181 semantics with 42.23 mIoU.

03

Improved open-vocabulary segmentation performance by +3.48 mIoU over SOTA.

Abstract

General-purpose mobile robots need to complete tasks without exact human instructions. Large language models (LLMs) is a promising direction for realizing commonsense world knowledge and reasoning-based planning. Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and the optimal z* is the centroid for any set Z. We derive a probabilistic bound for estimating separability of related and unrelated semantics. We prove that z* is discoverable by iterative optimization by gradient descent from visual appearance and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Bioinformatics

MethodsContrastive Language-Image Pre-training · Sentence-BERT