GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology
Shivendra Agrawal, Bradley Hayes

TL;DR
GIST is a multimodal knowledge extraction system that creates a semantic spatial topology from mobile point clouds, enabling advanced navigation and interaction tasks in complex environments.
Contribution
It introduces a novel pipeline that transforms point clouds into a semantic topology, improving spatial understanding and human-AI interaction in cluttered spaces.
Findings
Achieved a 1.04 m top-5 mean translation error in localizing semantic zones.
Outperformed sequence-based instruction generation baselines in LLM evaluations.
Reached an 80% navigation success rate using verbal cues in real-world tests.
Abstract
Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
