Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo; Zhen Yang; Yushan Li; Xinyue Zhang; Wenyu Gao; Jiacheng Wang; Chengzhi Li; Xiangrui Liu; Ping Jian

arXiv:2601.03590·cs.CV·January 8, 2026

Can LLMs See Without Pixels? Benchmarking Spatial Intelligence from Textual Descriptions

Zhongbin Guo, Zhen Yang, Yushan Li, Xinyue Zhang, Wenyu Gao, Jiacheng Wang, Chengzhi Li, Xiangrui Liu, Ping Jian

PDF

Open Access

TL;DR

This paper introduces SiT-Bench, a benchmark to evaluate Large Language Models' spatial intelligence using textual descriptions, revealing their strengths and limitations in global spatial reasoning without visual input.

Contribution

The paper presents SiT-Bench, a comprehensive textual benchmark for assessing LLMs' spatial reasoning, highlighting the importance of explicit reasoning and providing a new resource for future research.

Findings

01

LLMs excel in localized semantic tasks

02

A significant gap exists in global spatial consistency

03

Explicit spatial reasoning improves LLM performance

Abstract

Recent advancements in Spatial Intelligence (SI) have predominantly relied on Vision-Language Models (VLMs), yet a critical question remains: does spatial understanding originate from visual encoders or the fundamental reasoning backbone? Inspired by this question, we introduce SiT-Bench, a novel benchmark designed to evaluate the SI performance of Large Language Models (LLMs) without pixel-level input, comprises over 3,800 expert-annotated items across five primary categories and 17 subtasks, ranging from egocentric navigation and perspective transformation to fine-grained robotic manipulation. By converting single/multi-view scenes into high-fidelity, coordinate-aware textual descriptions, we challenge LLMs to perform symbolic textual reasoning rather than visual pattern matching. Evaluation results of state-of-the-art (SOTA) LLMs reveals that while models achieve proficiency in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Language and cultural evolution