On the Generalization Capacities of MLLMs for Spatial Intelligence

Gongjie Zhang; Wenhao Li; Quanhao Qian; Jiuniu Wang; Deli Zhao; Shijian Lu; Ran Xu

arXiv:2603.06704·cs.CV·March 10, 2026

On the Generalization Capacities of MLLMs for Spatial Intelligence

Gongjie Zhang, Wenhao Li, Quanhao Qian, Jiuniu Wang, Deli Zhao, Shijian Lu, Ran Xu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a camera-aware framework for multimodal large language models to improve their spatial reasoning and generalization across different camera setups by disentangling camera parameters from scene understanding.

Contribution

The paper proposes a novel camera-aware MLLM framework that incorporates camera intrinsics, data augmentation, and geometric priors to enhance spatial reasoning and generalization.

Findings

01

Camera-aware MLLMs outperform naive models in cross-camera tests.

02

Disentangling camera parameters improves spatial reasoning.

03

Camera-awareness is essential for robust spatial intelligence.

Abstract

Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these RGB-only approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 6Confidence 3

Strengths

- The writing is clear, and the logical flow is easy to follow. - The benchmark coverage is extensive. - Incorporating ray embedding, camera-aware augmentation, and distillation from 3D models is a reasonable and well-motivated approach.

Weaknesses

- Figure 1 may be misleading. The caption states, “There is no way to know unless I know the camera intrinsics!” However, even with camera intrinsics, it is still not analytically possible to determine the exact 3D location from a single RGB image. It would be helpful to clarify that priors are still needed to estimate the 3D position. - The performance reported in Tables 2 and 3 is comparable to that of other spatial MLLMs, which makes the paper’s claims less convincing.

Reviewer 02Rating 8Confidence 4

Strengths

- The paper is well-written and easy to follow - The idea in general makes sense and the results look good - The experiment section is well designed and answers most research questions raised by the paper

Weaknesses

No major weaknesses per se, just some comments regarding the writing and presentation of the work 1. About Table 1: This table nicely shows that prior VLMs do not benefit from multiple dataset training and are susceptible to zooming-in/out operations. I was expecting the paper to show at the end that this problem is now resolved with the proposed technique and the paper does show it but it wasn’t straightforward to make this connection. Specifically, Table-1, Figure-6 and Table-5 are connected

Reviewer 03Rating 4Confidence 4

Strengths

1. The paper demonstrates that camera-agnostic MLLMs fail to generalize across different camera parameters, highlighting a critical but overlooked limitation in current spatial reasoning approaches. 2. The proposed framework combines three complementary components to resolve the identified ambiguity, with experimental results showing improvements in cross-camera generalization on multiple spatially-grounded tasks.

Weaknesses

1. Table 5 does not include "Prior Distillation only" or "Geom Aug + Prior Dist (w/o Ray Emb)", making it hard to isolate the contribution of camera ray embeddings. Given that UniDepth v2 is pretrained on 10M+ RGB-depth pairs, the geometric prior distillation likely contributes the majority of improvements, but this is never quantified. 2. Outdated and incomplete baselines: * Table 2/3/4 use Gemini-1.5-Flash/Pro (Feb 2024) instead of Gemini-2.5 or later versions * Table 4 is missing Qwen2.5-VL-

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Spatial Cognition and Navigation