CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Wei Wang,Yuqian Yuan,Tianwei Lin,Wenqiao Zhang,Siliang Tang,Jun Xiao,Yueting Zhuang

arXiv:2605.18621·cs.CV·May 19, 2026

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Wei Wang,Yuqian Yuan,Tianwei Lin,Wenqiao Zhang,Siliang Tang,Jun Xiao,Yueting Zhuang

PDF

1 Repo

TL;DR

This paper introduces CrossView Suite, a comprehensive framework for enhancing multimodal large language models' ability to understand and reason across multiple viewpoints through new datasets, benchmarks, and alignment methods.

Contribution

It develops a large-scale cross-view dataset, a systematic benchmark, and an explicit alignment framework to advance multi-view spatial reasoning in MLLMs.

Findings

01

Large-scale cross-view dataset with 1.6M samples improves training.

02

Systematic benchmark enables comprehensive evaluation.

03

Explicit alignment enhances cross-view reasoning capabilities.

Abstract

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Thinkirin/Crossview-Suite
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.