DSI-Bench: A Benchmark for Dynamic Spatial Intelligence
Ziang Zhang, Zehan Wang, Guanghao Zhang, Weilong Dai, Yan Xia, Ziang Yan, Minjie Hong, Zhou Zhao

TL;DR
DSI-Bench introduces a comprehensive benchmark with nearly 1,000 videos and 1,700 questions to evaluate models' understanding of dynamic 3D spatial relationships, revealing current limitations in vision-language and expert models.
Contribution
The paper presents DSI-Bench, a novel benchmark for dynamic spatial reasoning, with a diverse dataset and evaluation framework to systematically assess model capabilities.
Findings
Models often confuse observer and object motion.
Semantic biases affect model reasoning.
Current models struggle with relative dynamic relationships.
Abstract
Reasoning about dynamic spatial relationships is essential, as both observers and objects often move simultaneously. Although vision-language models (VLMs) and visual expertise models excel in 2D tasks and static scenarios, their ability to fully understand dynamic 3D scenarios remains limited. We introduce Dynamic Spatial Intelligence and propose DSI-Bench, a benchmark with nearly 1,000 dynamic videos and over 1,700 manually annotated questions covering nine decoupled motion patterns of observers and objects. Spatially and temporally symmetric designs reduce biases and enable systematic evaluation of models' reasoning about self-motion and object motion. Our evaluation of 14 VLMs and expert models reveals key limitations: models often conflate observer and object motion, exhibit semantic biases, and fail to accurately infer relative relationships in dynamic scenarios. Our DSI-Bench…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The DSI-bench is a good effort to benchmark a model's spatial intelligence, especially in more dynamic scenes. Most existing evaluations focus on static scenes or observers, and the DSI-bench provides a good benchmark for the dynamic scenes. The spatial and temporal augmentation to reduce the bias is a good effort. Correspondingly, the group evaluation is a good indicator of robustness. It seems that all annotations are reviewed and confirmed by human annotators, including the augmented o
1) There paper does not mention existing benchmarks that evaluates dynamic spatial reasoning. While additional benchmarks in this domain are useful, it would be great to compare and contrast the proposed one to the existing benchmarks. For example: - SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models, COLM 2025 2) The question types are very limited, only a few fixed templates. Why not expand to more questions? 3) The task’s difficulty is not thoroughly studied. The zero-sh
1. Clear Motivation and Well-Defined Problem Setting The paper is clearly motivated by an under-explored but important challenge: joint reasoning about observer and object motion in dynamic 3D scenes. While prior benchmarks focus mainly on static or single-motion settings, DSI-Bench explicitly targets the coupled dynamics of self-motion and object motion. 2. High-Quality Dataset and Comprehensive Evaluation The dataset construction process is systematic and rigorous: nearly 1,000 videos are sta
1. Limited novelty comparison in task setup: While the paper’s focus on joint observer–object motion reasoning is distinctive within general VLM benchmarks, similar multi-motion setups have been extensively studied in other domains, such as autonomous driving and embodied simulation (e.g., ego-vehicle vs. surrounding vehicle motion reasoning)[1,2]. The paper would benefit from a clearer comparison and positioning against these prior datasets and tasks, highlighting what makes DSI-Bench fundament
The problem is natural, necessary, and clearly not one that current models can handle. The examples are also nicely explanatory of the task.
1. See questions below 2. The model prompts are not presented in the appendix. I would like a better understanding of how the way data/questions are presented to the models affects performance. This includes few-shot settings. I will propose an extreme case. Imagine I gave the the model three of four augmentations in the context. Would it still fail to predict the fourth on inference? What about just one example? 3. *Minor* -- typos, "genrated" and Fig 1 has ~no caption, backward quotes, etc
- Dynamic Spatial Intelligence is a valuable research direction. This paper proposes a benchmark for Dynamic Spatial Intelligence with clear categorization and sufficient scale. - It employs Spatio-Temporal Flip Augmentation to reduce bias. - The experiments and analyses are relatively comprehensive.
- Video sources are limited; it is unclear whether the dataset is sufficiently diverse. A thorough analysis of scene diversity is needed. - The question templates are not clearly enumerated or shown; this is important. An analysis of question diversity is needed. - The manual verification process is insufficiently described; clarify procedures and justify data quality guarantees. - Evaluating dedicated spatial understanding models would strengthen the work, though it is not strictly necessary. -
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Spatial Cognition and Navigation · Constraint Satisfaction and Optimization
