SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Ellis Brown; Arijit Ray; Ranjay Krishna; Ross Girshick; Rob Fergus; Saining Xie

arXiv:2511.04668·cs.CV·November 17, 2025

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

PDF

Open Access 2 Datasets

TL;DR

SIMS-V introduces a simulation-based data generation framework for training multimodal language models in spatial video understanding, enabling efficient transfer to real-world tasks with fewer data and question types.

Contribution

The paper presents SIMS-V, a novel systematic data-generation framework using 3D simulators to improve spatial reasoning in multimodal models, reducing data requirements and enhancing transferability.

Findings

01

A minimal set of three question categories suffices for effective transfer.

02

A 7B-parameter model fine-tuned on 25K simulated examples outperforms larger baselines.

03

The approach maintains general video understanding while improving spatial reasoning on real-world benchmarks.

Abstract

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V -- a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications