TL;DR
SIRI-Bench is a new benchmark with 9,000 video-question-answer triplets in 3D scenes, designed to evaluate and challenge VLMs' spatial reasoning abilities.
Contribution
The paper introduces SIRI-Bench, a comprehensive benchmark for assessing VLMs' structural spatial intelligence through complex reasoning tasks.
Findings
State-of-the-art VLMs perform poorly on SIRI-Bench.
The benchmark reveals significant gaps in current models' spatial reasoning.
An automatic scene creation engine was developed for data synthesis.
Abstract
Large Language Models (LLMs) have undergone rapid progress, largely attributed to reinforcement learning on complex reasoning tasks. In contrast, while spatial intelligence is fundamental for Vision-Language Models (VLMs) in real-world interaction, the systematic study of their complex spatial reasoning remains underexplored. To bridge this gap, we introduce SIRI-Bench, a benchmark designed to evaluate VLMs' structural spatial intelligence through spatial-grounded reasoning tasks. SIRI-Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene. The benchmark is carefully designed so that solving each problem requires both spatial comprehension and structural reasoning. To facilitate large-scale data synthesis, we develop an Automatic Scene Creation Engine that employs collaborative LLM agents to translate abstract mathematical problems…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
