Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR
Lixing Guo, Tobias H\"ollerer

TL;DR
This paper introduces a modular AR agent system that combines large language models with grounded vision models to interpret complex natural language queries for spatial retrieval and scene understanding in augmented reality environments.
Contribution
It presents a novel modular architecture integrating MLLMs and perception tools for relational reasoning and spatial retrieval, supporting plug-and-play models without retraining.
Findings
Enables understanding of object relations and interactions in 3D space.
Supports complex language queries for spatial localization and reasoning.
Provides an evaluation framework for real-world spatial grounding tasks.
Abstract
Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Robotics and Sensor-Based Localization
