SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

Zhicheng Qiu; Jiarui Meng; Tong-an Luo; Yican Huang; Xuan Feng; Xuanfu Li; ZHan Xu

arXiv:2603.22893·cs.CV·March 27, 2026

SLARM: Streaming and Language-Aligned Reconstruction Model for Dynamic Scenes

Zhicheng Qiu, Jiarui Meng, Tong-an Luo, Yican Huang, Xuan Feng, Xuanfu Li, ZHan Xu

PDF

Open Access

TL;DR

SLARM is a real-time, unified model for dynamic scene reconstruction, semantic understanding, and streaming inference that leverages higher-order motion modeling and language-aligned features for improved accuracy and robustness.

Contribution

It introduces SLARM, a novel feed-forward framework that combines dynamic reconstruction, semantic querying, and streaming inference without flow supervision.

Findings

01

Achieves 21% better motion accuracy.

02

Improves reconstruction PSNR by 1.6 dB.

03

Increases segmentation mIoU by 20%.

Abstract

We propose SLARM, a feed-forward model that unifies dynamic scene reconstruction, semantic understanding, and real-time streaming inference. SLARM captures complex, non-uniform motion through higher-order motion modeling, trained solely on differentiable renderings without any flow supervision. Besides, SLARM distills semantic features from LSeg to obtain language-aligned representations. This design enables semantic querying via natural language, and the tight coupling between semantics and geometry further enhances the accuracy and robustness of dynamic reconstruction. Moreover, SLARM processes image sequences using window-based causal attention, achieving stable, low-latency streaming inference without accumulating memory cost. Within this unified framework, SLARM achieves state-of-the-art results in dynamic estimation, rendering quality, and scene parsing, improving motion accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis