Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie

TL;DR
This paper introduces the concept of spatial supersensing in video understanding, proposes new benchmarks to evaluate it, and demonstrates that scaling data alone is insufficient, advocating for predictive sensing to advance true multimodal intelligence.
Contribution
It defines the paradigm of spatial supersensing, introduces VSI-SUPER benchmarks, and presents a predictive sensing approach that significantly improves spatial understanding in videos.
Findings
VSI-SUPER benchmarks reveal current models' limitations in spatial cognition.
Scaling data improves performance but does not fully achieve spatial supersensing.
Predictive sensing with surprise-driven memory outperforms existing baselines.
Abstract
We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These…
Peer Reviews
Decision·ICLR 2026 Poster
1.This work discusses an interesting new paradigm for multimodal video modeling called spatial supersensing, which aims to overcome the limitations of previous methods in predictive modeling driven by internal world model. 2.The proposed predictive sensing paradigm seems to be capable of generalizing to various downstream tasks and could be a more advanced version of multimodal intelligence, supported by a lot of experiments and analyses.
1.Although the authors claimed that the proposed predictive modeling paradigm better helps downstream video understanding tasks, it seems this argument lacks sufficient experimental evidence. What would be the advantage when it comes to downstream generalizatioin comparing the proposed predictive paradigm and previous paradigms? 2.Another concern is that, from my personal understanding, the proposed framework utilizes the "error" between the predicted next frame latent and the ground-truth next
1. Originality and Significance: The paper’s primary strength is its insightful framing. The 4-level taxonomy is a clear and useful way to structure the field's challenges. The diagnostic audit of existing benchmarks (Fig. 2), which shows many are solvable with text captions, is a solid contribution that validates the need for VSI-SUPER. 2. The task design of VSO is well-grounded. 2. The experimental structure is very effective at proving the paper's story.
Overall, this is a good work, although with some overclaims. 1. **Limited Task Complexity:** While VSI-SUPER is effective at probing long-horizon memory, the tasks themselves are synthetic and narrow. VSO relies on finding artificially inserted objects, and VSC is a simple counting task (more on "why calling it simple when frontiner models fail later). This is a reasonable first step, but these tasks do not yet capture the full scope of "spatial supersensing," which should arguably involve more
1. The huge effort to collect and curate the VSI-Super benchmark and VSI-590K dataset demonstrates the great workload of this paper, which I think this can boost the spatial intelligence community if released with high-quality. 2. The so-called predictive sensing, which is modeled by next frame prediction, sounds like a reasonable way to maintain the history memory context for the scenarios that go smoothly and do not change the scene or even entities drastically. 3. Organizing the next latent
1. The VSO task, which requires MLLMs to observe long spatiotemporal videos and recall the specific locations of an unusual object in the correct order of its appearance, sounds very similar to Needle In A Video Haystack and the common spatial perception task, which requires object appearance order. Can the authors explain and demonstrate the main differences and also the motivations? 2. Regarding the VSC task, does the model need to recognize and count the objects from different instance level
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
