TL;DR
This paper introduces CrossView Suite, a comprehensive framework for enhancing multimodal large language models' ability to understand and reason across multiple viewpoints through new datasets, benchmarks, and alignment methods.
Contribution
It develops a large-scale cross-view dataset, a systematic benchmark, and an explicit alignment framework to advance multi-view spatial reasoning in MLLMs.
Findings
Large-scale cross-view dataset with 1.6M samples improves training.
Systematic benchmark enables comprehensive evaluation.
Explicit alignment enhances cross-view reasoning capabilities.
Abstract
Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
