COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang; Xiangzhao Hao; Hengzhu Tang; Zhenyu Zhang; Jiawei Sheng; Xiaodong Li; Zhenyang Li; Li Gao; Daiting Shi; Dawei Yin; Tingwen Liu

arXiv:2512.04563·cs.CV·December 8, 2025

COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence

Zefeng Zhang, Xiangzhao Hao, Hengzhu Tang, Zhenyu Zhang, Jiawei Sheng, Xiaodong Li, Zhenyang Li, Li Gao, Daiting Shi, Dawei Yin, Tingwen Liu

PDF

Open Access 2 Models 1 Datasets

TL;DR

COOPER is a unified multimodal large language model that integrates perception and reasoning to improve 3D-aware spatial understanding, demonstrating significant performance gains in spatial reasoning tasks.

Contribution

This work introduces COOPER, a novel unified model that combines perception and reasoning in a two-stage training process for enhanced spatial intelligence.

Findings

01

COOPER achieves a 6.91% improvement in spatial reasoning.

02

A variant trained only for auxiliary modality generation gains 7.92% in distance and size estimation.

03

Learning auxiliary modalities enhances internal spatial knowledge.

Abstract

Visual Spatial Reasoning is crucial for enabling Multimodal Large Language Models (MLLMs) to understand object properties and spatial relationships, yet current models still struggle with 3D-aware reasoning. Existing approaches typically enhance either perception, by augmenting RGB inputs with auxiliary modalities such as depth and segmentation, or reasoning, by training on spatial VQA datasets and applying reinforcement learning, and thus treat these two aspects in isolation. In this work, we investigate whether a unified MLLM can develop an intrinsic ability to enhance spatial perception and, through adaptive interleaved reasoning, achieve stronger spatial intelligence. We propose \textbf{COOPER}, a unified MLLM that leverages depth and segmentation as auxiliary modalities and is trained in two stages to acquire auxiliary modality generation and adaptive, interleaved reasoning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Starrrrrry/COOPER_Train_Set
dataset· 705 dl
705 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization