The Spatial Blindspot of Vision-Language Models
Nahid Alam, Leema Krishna Murali, Siddhant Bharadwaj, Patrick Liu, Timothy Chung, Drishti Sharma, Akshata A, Kranthi Kiran, Wesley Tam, Bala Krishna S Vegesna

TL;DR
This paper identifies a spatial reasoning blindspot in current vision-language models caused by flattening images into 1D patches, and explores architectural modifications to enhance spatial understanding.
Contribution
It introduces alternative training objectives and 2D positional encodings to improve spatial reasoning in vision-language models.
Findings
Architectural changes improve spatial reasoning benchmarks
Alternative objectives enhance spatial awareness
2D positional encodings contribute to better spatial grounding
Abstract
Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper focuses on an important and often underexplored aspect of Vision-Language Models (VLMs): spatial awareness. Improving spatial reasoning is crucial for grounding and physical understanding in multimodal models. - The authors attempt to address this by introducing 2D rotary positional embeddings within the LLaVA framework and by evaluating few alternative image encoders (e.g., SigLIP, SigLIP2, AIMv2) to study their impact on spatial reasoning performance.
1. **Overstated Novelty: Prior Work Already Explores 2D or Multi-Dimensional Positional Embeddings** Prior work has already integrated 2D (or multi-dimensional) positional embeddings in vision or vision-language models, so the paper’s claim that this area is “under-explored” is inaccurate. Qwen2-VL introduced Multimodal Rotary Positional Embedding (M-RoPE), which decomposes positional embeddings into three parts capturing 1D text, 2D visual (height/width), and 3D video temporal information [1]
1. The paper clearly isolates architectural sources of spatial reasoning failure in VLMs, shifting attention from data or fine-tuning toward the underlying design of image encoders and positional embeddings. 2. By varying only the encoder and positional encoding within an identical LLaVA pipeline, the study provides a controlled and fair comparison that strengthens the validity of its empirical observations. 3. The evaluation spans multiple spatial reasoning and general multimodal benchmarks,
1. The paper identifies the causes of spatial blindness but does not propose or evaluate a real fix beyond minor encoder swaps and a simple 2D-RoPE variant, limiting its contribution to diagnosis rather than advancement. 2. The conclusion that dense or autoregressive pre-training improves spatial reasoning is based solely on benchmark trends, without any feature-level analysis, visualisation, or causal verification. 3. The encoders compared differ not only in pre-training objectives but also s
1. Important Problem: The paper tackles a critical and widely acknowledged weakness in VLMs—spatial reasoning—which is a known bottleneck for applications like Embodied AI. 2. Systematic Design: The core design of the study—comparing different encoder objectives and positional encodings within the controlled LLaVA framework—is systematic and a sound scientific approach. 3. Comprehensive Benchmarking: The authors evaluate performance on a large (7+) and diverse suite of specialized spatial benc
1. Scattered Logic and Misaligned Experimental Focus: The paper's core scientific contribution should be the intra-LLaVA ablation study. However, the experimental analysis is completely unfocused. The narrative erratically jumps between this controlled study and extensive "SOTA-comparisons" against frontier models like Qwen2.5-VL. The authors later dismiss these comparisons as not "apples-to-apples," which makes their inclusion distracting and confusing. This scattered focus makes the paper's co
* I commend the authors on the range of encoder architectures evaluated and the decision to include variants with and without the 2D positional encoding * The paper is generally well written and easy to follow, with clear motivations and thoughtful error analyses. I found the targeted comparisons between specific model pairs and discussion surrounding why the results might be arise helpful and a point in favor of the paper. * I think the introduction provides a particularly strong overview of th
* While I overall enjoyed reading the paper, I think the mixed nature of the results makes it difficult for me to synthesize a key takeaway. I particularly don’t feel like this sentence from the Conclusion is justified `Overall, the findings highlight that encoder design strongly shapes spatial awareness within VLM families`. While there were differences in overall performance between models, the difference among the fine-tuned models were quite muted (with some notable exceptions that the autho
The paper presents a strong and well-motivated problem statement, and the authors provide extensive experiments and ablations demonstrating that the choice of vision encoder and the design of positional encoding play an important role in improving spatial reasoning in vision-language models.
The main weakness of the paper lies in its limited novelty. Both proposed directions, introducing 2D rotary positional embeddings and replacing the CLIP encoder with stronger vision backbones such as SigLIP or AIMv2, have already been explored in prior work. Specifically, Qwen2-VL has introduced M-RoPE for preserving spatial structure, and Cambrian-1 has systematically studied the effect of different vision encoders, including SigLIP, on multimodal reasoning. As a result, the paper’s contributio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Language, Metaphor, and Cognition
