Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models
Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu, Yufan Chen, Junwei Zheng, Hao Shi, Yi Zhou, M. Saquib Sarfraz, Danda Pani Paudel, Luc Van Gool

TL;DR
This paper introduces a new benchmark and dataset for multi-robot cooperative egocentric spatial reasoning using multimodal large language models, demonstrating improved reasoning performance and generalization.
Contribution
It presents CoopSR, a novel benchmark and EgoTeam dataset for multi-robot spatial reasoning, and proposes SP-CoR, a framework that enhances reasoning accuracy and generalization.
Findings
SP-CoR outperforms baselines by +3.87% on Habitat and +7.12% on iGibson.
The dataset includes 114,227 QA pairs across multiple question types and difficulty levels.
SP-CoR generalizes well to unseen team sizes and real-world robot tests.
Abstract
Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
