EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

Zhenghao Chen; Huiqun Wang; Di Huang

arXiv:2604.03318·cs.CV·April 7, 2026

EgoMind: Activating Spatial Cognition through Linguistic Reasoning in MLLMs

Zhenghao Chen, Huiqun Wang, Di Huang

PDF

1 Repo 1 Models

TL;DR

EgoMind introduces a novel linguistic reasoning framework that enhances spatial cognition in multimodal large language models without relying on geometric priors, using scene graphs and progressive analysis.

Contribution

It presents EgoMind, a chain-of-thought approach that enables geometry-free spatial reasoning through linguistic scene graphs and progressive question answering.

Findings

01

Achieves competitive results on multiple spatial reasoning benchmarks.

02

Uses only 5K auto-generated supervised fine-tuning samples.

03

Demonstrates the effectiveness of linguistic reasoning for spatial cognition.

Abstract

Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Hyggge/EgoMind
github

Models

🤗
Hyggge/EgoMind-7B
model· 28 dl· ♡ 2
28 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.