Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu; Diankun Wu; Jiawei Chi; Yimo Cai; Yi-Hsin Hung; Xumin Yu; Hao Li; Han Hu; Yongming Rao; Yueqi Duan

arXiv:2603.12255·cs.CV·March 13, 2026

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan

PDF

Open Access

TL;DR

Spatial-TTT introduces a novel streaming visual spatial intelligence model that adapts parameters at test time, effectively capturing and organizing long-term spatial information from video streams for improved spatial understanding.

Contribution

The paper proposes a hybrid architecture with test-time training and a spatial-predictive mechanism, along with a new dataset, to enhance long-horizon spatial reasoning in videos.

Findings

01

Achieves state-of-the-art results on video spatial benchmarks.

02

Effectively captures geometric and temporal continuity across frames.

03

Improves long-term spatial understanding in streaming video data.

Abstract

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows but how spatial information is selected, organized, and retained over time. In this paper, we propose Spatial-TTT towards streaming visual-based spatial intelligence with test-time training (TTT), which adapts a subset of parameters (fast weights) to capture and organize spatial evidence over long-horizon scene videos. Specifically, we design a hybrid architecture and adopt large-chunk updates parallel with sliding-window attention for efficient spatial video processing. To further promote spatial awareness, we introduce a spatial-predictive mechanism applied to TTT layers with 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Face recognition and analysis · Human Pose and Action Recognition