Ego to World: Collaborative Spatial Reasoning in Embodied Systems via Reinforcement Learning
Heng Zhou, Li Kang, Yiran Qin, Xiufeng Song, Ao Yu, Zilu Zhang, Haoming Song, Kaixin Xu, Yuchen Fan, Dongzhan Zhou, Xiaohong Liu, Ruimao Zhang, Philip Torr, Lei Bai, Zhenfei Yin

TL;DR
This paper introduces the Ego-to-World benchmark and a novel CoRL framework that combines reasoning and reinforcement learning to improve multi-agent spatial understanding and manipulation from partial views.
Contribution
The paper presents the E2W benchmark for evaluating multi-view spatial reasoning and introduces CoRL, a two-stage method that enhances reasoning accuracy through cross-view rewards and generalizes to real-world multi-robot tasks.
Findings
CoRL outperforms existing baselines on E2W reasoning tasks.
The approach generalizes well to external spatial reasoning benchmarks.
CoRL enables effective real-world multi-robot manipulation with multi-camera rigs.
Abstract
Understanding the world from distributed, partial viewpoints is a fundamental challenge for embodied multi-agent systems. Each agent perceives the environment through an ego-centric view that is often limited by occlusion and ambiguity. To study this problem, we introduce the Ego-to-World (E2W) benchmark, which evaluates a vision-language model's ability to fuse heterogeneous viewpoints across three tasks: (i) global counting, (ii) relational location reasoning, and (iii) action-oriented grasping that requires predicting view-specific image coordinates. To address this setting, we propose CoRL, a two-stage framework that combines Chain-of-Thought supervised fine-tuning with reinforcement learning using Group-Relative Policy Optimization. Its core component, the Cross-View Spatial Reward (CVSR), provides dense task-aligned feedback by linking reasoning steps to visual evidence, ensuring…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper defines the task of collaborative spatial reasoning under distributed ego-centric observations, which is parctically crucial as many real-world multi-robot and multi-camera systems inherently operate under partial, viewpoint-specific observations, and yet proper methods for such scenarios are underexplored in embodied AI. 2. The proposed goal to transform fragmented ego-centric views into a globally coherent and semantically consistent scene representation is novel and elegant. Suc
1. Although E2W covers diverse tasks, the visual data shown in the paper is relatively clean and uncluttered. The presented static-tabletop setups of real-world setting and simulation scene in Figure 2.c contain sparse objects with few occlusions and background noise. This simplicity may underrepresent challenges encountered in real-world embodied settings, thus more complex and visually noisy scenes would make the benchmark more representative of the embodied AI scenarios that the paper aims to
1. The paper is easy to follow. 2. The proposed reward used for GRPO is able to guide the finetuning of a VLM to become better in terms of spatial reasoning, specifically counting, location reasoning and affordance prediction.
1. The proposed method uses a single VLM to handle multiple views, which disobey the claim that the paper address "collaborative" spatial reasoning. 2. With the above mentioned point, the paper is essentially proposing a multi-image spatial reasoning task and method, where the difficulties of the proposed tasks seem to be easier than already existing benchmark like MindCube [1] 3. The proposed method is simply using existing framework, i.e. SFT+GRPO, which has already been proved to be useful
- Originality: Introduces a key robotics problem—multi-agent, multi-view embodied spatial reasoning—and presents the novel E2W benchmark and the Cross-View Spatial Reward (CVSR). - Quality: The E2W benchmark is well defined, practical, and accurately captures multi-view spatial reasoning. The CVSR is straightforward and efficient. - Clarity: The problem formulation, method overview, training objective, and reward design are clearly specified, with equations and figures, within a transparent pipe
- The real-world subset includes only single-image, single-agent samples; multi-view and multi-agent settings are available only in simulation. - The real-world evaluation setup is overly simple, with a black cloth background and uniform lighting. VLMs should also be tested in more in-the-wild, cluttered scenes with varied lighting and occlusion to assess robustness.
- This paper introduces the E2W benchmark to rigorously evaluate multi-view spatial reasoning in VLMs across global counting, relational localization, and action-oriented grasping with view-specific coordinates. - The CoRL framework and reward design are intuitive. And the paper demonstrates consistent gains over strong proprietary and open-source baselines on reasoning and perception-grounding metrics; ablations substantiate the necessity of each CVSR component.
- *Problem motivation and setting rationale:* The paper does not convincingly justify when multi-view, per-agent ego-centric fusion is preferable to constructing a single global view, especially in real-world multi-robot systems where calibrated multi-camera rigs are available. While the authors show that global-view inputs under token downsampling perform worse, this seems like a resource-allocation artifact rather than a fundamental limitation. Please clarify the operational regimes where the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
