SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Zijian Song; Xiaoxin Lin; Qiuming Huang; Sihan Qin; Guangrun Wang; Liang Lin

arXiv:2506.14512·cs.CV·April 15, 2026

SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks

Zijian Song, Xiaoxin Lin, Qiuming Huang, Sihan Qin, Guangrun Wang, Liang Lin

PDF

1 Models

TL;DR

SIRI-Bench is a new benchmark with 9,000 video-question-answer triplets in 3D scenes, designed to evaluate and challenge VLMs' spatial reasoning abilities.

Contribution

The paper introduces SIRI-Bench, a comprehensive benchmark for assessing VLMs' structural spatial intelligence through complex reasoning tasks.

Findings

01

State-of-the-art VLMs perform poorly on SIRI-Bench.

02

The benchmark reveals significant gaps in current models' spatial reasoning.

03

An automatic scene creation engine was developed for data synthesis.

Abstract

Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
remyxai/SpaceQwen2.5-VL-3B-Instruct
model· 897 dl· ♡ 18
897 dl♡ 18

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.