Compositional Semantics for Open Vocabulary Spatio-semantic Representations
Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda

TL;DR
This paper introduces a novel approach for representing complex spatio-semantic information in mobile robots using latent compositional semantic embeddings, enabling better reasoning and memory retrieval beyond immediate perception.
Contribution
The authors propose a mathematically grounded method for learning and discovering compositional semantic embeddings that improve open-vocabulary spatio-semantic reasoning in vision-language models.
Findings
z* embeddings can represent up to 10 semantics with SBERT and 100 in ideal conditions.
A simple VLM trained on COCO-Stuff learns z* for 181 semantics with 42.23 mIoU.
Improved open-vocabulary segmentation performance by +3.48 mIoU over SOTA.
Abstract
General-purpose mobile robots need to complete tasks without exact human instructions. Large language models (LLMs) is a promising direction for realizing commonsense world knowledge and reasoning-based planning. Vision-language models (VLMs) transform environment percepts into vision-language semantics interpretable by LLMs. However, completing complex tasks often requires reasoning about information beyond what is currently perceived. We propose latent compositional semantic embeddings z* as a principled learning-based knowledge representation for queryable spatio-semantic memories. We mathematically prove that z* can always be found, and the optimal z* is the centroid for any set Z. We derive a probabilistic bound for estimating separability of related and unrelated semantics. We prove that z* is discoverable by iterative optimization by gradient descent from visual appearance and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Bioinformatics
MethodsContrastive Language-Image Pre-training · Sentence-BERT
