TL;DR
EgoMind introduces a novel linguistic reasoning framework that enhances spatial cognition in multimodal large language models without relying on geometric priors, using scene graphs and progressive analysis.
Contribution
It presents EgoMind, a chain-of-thought approach that enables geometry-free spatial reasoning through linguistic scene graphs and progressive question answering.
Findings
Achieves competitive results on multiple spatial reasoning benchmarks.
Uses only 5K auto-generated supervised fine-tuning samples.
Demonstrates the effectiveness of linguistic reasoning for spatial cognition.
Abstract
Multimodal large language models (MLLMs) are increasingly being applied to spatial cognition tasks, where they are expected to understand and interact with complex environments. Most existing works improve spatial reasoning by introducing 3D priors or geometric supervision, which enhances performance but incurs substantial data preparation and alignment costs. In contrast, purely 2D approaches often struggle with multi-frame spatial reasoning due to their limited ability to capture cross-frame spatial relationships. To address these limitations, we propose EgoMind, a Chain-of-Thought framework that enables geometry-free spatial reasoning through Role-Play Caption, which jointly constructs a coherent linguistic scene graph across frames, and Progressive Spatial Analysis, which progressively reasons toward task-specific questions. With only 5K auto-generated SFT samples and 20K RL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
