GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

Shivendra Agrawal; Bradley Hayes

arXiv:2604.15495·cs.AI·April 20, 2026

GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology

Shivendra Agrawal, Bradley Hayes

PDF

TL;DR

GIST is a multimodal knowledge extraction system that creates a semantic spatial topology from mobile point clouds, enabling advanced navigation and interaction tasks in complex environments.

Contribution

It introduces a novel pipeline that transforms point clouds into a semantic topology, improving spatial understanding and human-AI interaction in cluttered spaces.

Findings

01

Achieved a 1.04 m top-5 mean translation error in localizing semantic zones.

02

Outperformed sequence-based instruction generation baselines in LLM evaluations.

03

Reached an 80% navigation success rate using verbal cues in real-world tests.

Abstract

Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.