Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

Kunyu Peng; Zhikun Zhou; Kailun Yang; Di Wen; Ruiping Liu; Yufan Chen; Junwei Zheng; Hao Shi; Yi Zhou; M. Saquib Sarfraz; Danda Pani Paudel; Luc Van Gool

arXiv:2605.18431·cs.CV·May 20, 2026

Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

Kunyu Peng, Zhikun Zhou, Kailun Yang, Di Wen, Ruiping Liu, Yufan Chen, Junwei Zheng, Hao Shi, Yi Zhou, M. Saquib Sarfraz, Danda Pani Paudel, Luc Van Gool

PDF

1 Repo

TL;DR

This paper introduces a new benchmark and dataset for multi-robot cooperative egocentric spatial reasoning using multimodal large language models, demonstrating improved reasoning performance and generalization.

Contribution

It presents CoopSR, a novel benchmark and EgoTeam dataset for multi-robot spatial reasoning, and proposes SP-CoR, a framework that enhances reasoning accuracy and generalization.

Findings

01

SP-CoR outperforms baselines by +3.87% on Habitat and +7.12% on iGibson.

02

The dataset includes 114,227 QA pairs across multiple question types and difficulty levels.

03

SP-CoR generalizes well to unseen team sizes and real-world robot tests.

Abstract

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KPeng9510/seeing-together.git
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.