RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Fernando Ropero; Erkin Turkoz; Daniel Matos; Junqing Du; Antonio Ruiz; Yanfeng Zhang; Lu Liu; Mingwei Sun; Yongliang Wang

arXiv:2603.15386·cs.CV·March 17, 2026

RieMind: Geometry-Grounded Spatial Agent for Scene Understanding

Fernando Ropero, Erkin Turkoz, Daniel Matos, Junqing Du, Antonio Ruiz, Yanfeng Zhang, Lu Liu, Mingwei Sun, Yongliang Wang

PDF

Open Access

TL;DR

This paper introduces RieMind, a geometry-grounded spatial agent that decouples perception and reasoning for improved indoor scene understanding, achieving significant performance gains over existing models.

Contribution

It proposes an agentic framework that grounds a large language model in a 3D scene graph, isolating reasoning from perception to enhance spatial reasoning capabilities.

Findings

01

Ground-truth based 3D scene graph significantly improves reasoning performance.

02

The approach outperforms previous methods by up to 16% without fine-tuning.

03

Agentic variant achieves 33-50% better performance than base VLMs.

Abstract

Visual Language Models (VLMs) have increasingly become the main paradigm for understanding indoor scenes, but they still struggle with metric and spatial reasoning. Current approaches rely on end-to-end video understanding or large-scale spatial question answering fine-tuning, inherently coupling perception and reasoning. In this paper, we investigate whether decoupling perception and reasoning leads to improved spatial reasoning. We propose an agentic framework for static 3D indoor scene reasoning that grounds an LLM in an explicit 3D scene graph (3DSG). Rather than ingesting videos directly, each scene is represented as a persistent 3DSG constructed by a dedicated perception module. To isolate reasoning performance, we instantiate the 3DSG from ground-truth annotations. The agent interacts with the scene exclusively through structured geometric tools that expose fundamental properties…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation