SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes
Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, ZHan Xu

TL;DR
SLARM is a real-time, unified model for dynamic scene reconstruction, semantic understanding, and streaming inference that leverages higher-order motion modeling and language-aligned features for improved accuracy and robustness.
Contribution
It introduces SLARM, a novel feed-forward framework that combines dynamic reconstruction, semantic querying, and streaming inference without flow supervision.
Findings
Achieves 21% better motion accuracy.
Improves reconstruction PSNR by 1.6 dB.
Increases segmentation mIoU by 20%.
Abstract
We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
