Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
Jiacheng Hua, Yishu Yin, Yuhang Wu, Tai Wang, Yifei Huang, Miao Liu

TL;DR
This paper introduces TRACE, a prompting method that enables multimodal large language models to perform better 3D spatial reasoning in videos by generating text-based environment representations.
Contribution
The paper proposes a novel prompting technique, TRACE, that improves 3D spatial reasoning in MLLMs by inducing structured text-based environment representations.
Findings
TRACE improves spatial question answering accuracy across multiple MLLM backbones.
Experiments show consistent gains over previous prompting strategies.
Ablation studies validate the effectiveness of each component of TRACE.
Abstract
Existing Multimodal Large Language Models (MLLMs) struggle with 3D spatial reasoning, as they fail to construct structured abstractions of the 3D environment depicted in video inputs. To bridge this gap, drawing inspiration from cognitive theories of allocentric spatial reasoning, we investigate how to enable MLLMs to model and reason over text-based spatial representations of video. Specifically, we introduce Textual Representation of Allocentric Context from Egocentric Video (TRACE), a prompting method that induces MLLMs to generate text-based representations of 3D environments as intermediate reasoning traces for more accurate spatial question answering. TRACE encodes meta-context, camera trajectories, and detailed object entities to support structured spatial reasoning over egocentric videos. Extensive experiments on VSI-Bench and OST-Bench demonstrate that TRACE yields notable and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
