TL;DR
STream3R introduces a scalable, Transformer-based framework for real-time 3D reconstruction from image sequences, outperforming prior methods especially in dynamic scenes and enabling large-scale pretraining.
Contribution
It reformulates 3D pointmap prediction as a causal Transformer problem, enabling efficient streaming processing and better generalization in dynamic environments.
Findings
Outperforms prior methods on static and dynamic scene benchmarks.
Efficiently processes image sequences using causal attention.
Compatible with large-scale pretraining and fine-tuning.
Abstract
We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure,…
Peer Reviews
Decision·ICLR 2026 Poster
### Originality * To the best of my knowledge, this is among the first works to extend feed-forward 3D reconstruction to sequential processing via a causal transformer, rather than relying on pose-graph/global alignment. * While it could use more elaboration, the [reg] token is an interesting mechanism to make the model explicitly aware of the anchor frame. In addition, omitting view embeddings is a novel choice that encourages order-agnostic generalization to different input image orders. ###
- Contribution may be limited: The gains appear to stem primarily from a causal transformer framework rather than components specific to 3D reconstruction. The paper should clarify what is fundamentally new beyond causal masking and standard transformer design, and what is uniquely tailored to 3D reconstruction. - Anchor design needs more elaboration: The proposed [reg] token is interesting, but there is no controlled comparison to simpler anchors such as a global CLS token or relative view pos
1. The core architectural proposal to use a decoder-only causal transformer with a KVCache is a novel and highly effective paradigm for streaming 3D reconstruction. It offers a more scalable and powerful alternative to prior methods based on RNNs-like structure (like CUT3R) or expensive global optimization (like VGG-T or DUSt3R-GA). 2. The method demonstrates sota competitive performance across a comprehensive suite of benchmarks, including video depth estimation, 3D reconstruction, and camera
please refer to the weakness part.
The target problem to support feed-forward and streaming 3D reconstruction is timely, the proposed framework also makes sense. Assume the review does not need to consider concurrent works, I think the proposed framework is compatable with existing concurrent works in terms of both performance and novelty.
I only have some minor concerns: - As shown in table 7, StreamVGGT is attending the comparison which is a concurrent work with similiar key idea. From the results, it seems the proposed method outperforms StreamVGGT, but with no any explanations. I understand this work is a concurrent work. However, the explanation about the differences bettween streamVGGT and the proposed method and why the results of the proposed method are better is helpful for understanding. - In the contribution list, the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
