Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR

Lixing Guo; Tobias H\"ollerer

arXiv:2512.00294·cs.CV·December 2, 2025

Words into World: A Task-Adaptive Agent for Language-Guided Spatial Retrieval in AR

Lixing Guo, Tobias H\"ollerer

PDF

Open Access

TL;DR

This paper introduces a modular AR agent system that combines large language models with grounded vision models to interpret complex natural language queries for spatial retrieval and scene understanding in augmented reality environments.

Contribution

It presents a novel modular architecture integrating MLLMs and perception tools for relational reasoning and spatial retrieval, supporting plug-and-play models without retraining.

Findings

01

Enables understanding of object relations and interactions in 3D space.

02

Supports complex language queries for spatial localization and reasoning.

03

Provides an evaluation framework for real-world spatial grounding tasks.

Abstract

Traditional augmented reality (AR) systems predominantly rely on fixed class detectors or fiducial markers, limiting their ability to interpret complex, open-vocabulary natural language queries. We present a modular AR agent system that integrates multimodal large language models (MLLMs) with grounded vision models to enable relational reasoning in space and language-conditioned spatial retrieval in physical environments. Our adaptive task agent coordinates MLLMs and coordinate-aware perception tools to address varying query complexities, ranging from simple object identification to multi-object relational reasoning, while returning meter-accurate 3D anchors. It constructs dynamic AR scene graphs encoding nine typed relations (spatial, structural-semantic, causal-functional), enabling MLLMs to understand not just what objects exist, but how they relate and interact in 3D space. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Robotics and Sensor-Based Localization