STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Yushi Lan; Yihang Luo; Fangzhou Hong; Shangchen Zhou; Honghua Chen; Zhaoyang Lyu; Shuai Yang; Bo Dai; Chen Change Loy; Xingang Pan

arXiv:2508.10893·cs.CV·August 15, 2025

STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, Xingang Pan

PDF

1 Models 3 Reviews

TL;DR

STream3R introduces a scalable, Transformer-based framework for real-time 3D reconstruction from image sequences, outperforming prior methods especially in dynamic scenes and enabling large-scale pretraining.

Contribution

It reformulates 3D pointmap prediction as a causal Transformer problem, enabling efficient streaming processing and better generalization in dynamic environments.

Findings

01

Outperforms prior methods on static and dynamic scene benchmarks.

02

Efficiently processes image sequences using causal attention.

03

Compatible with large-scale pretraining and fine-tuning.

Abstract

We present STream3R, a novel approach to 3D reconstruction that reformulates pointmap prediction as a decoder-only Transformer problem. Existing state-of-the-art methods for multi-view reconstruction either depend on expensive global optimization or rely on simplistic memory mechanisms that scale poorly with sequence length. In contrast, STream3R introduces an streaming framework that processes image sequences efficiently using causal attention, inspired by advances in modern language modeling. By learning geometric priors from large-scale 3D datasets, STream3R generalizes well to diverse and challenging scenarios, including dynamic scenes where traditional methods often fail. Extensive experiments show that our method consistently outperforms prior work across both static and dynamic scene benchmarks. Moreover, STream3R is inherently compatible with LLM-style training infrastructure,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 5

Strengths

### Originality * To the best of my knowledge, this is among the first works to extend feed-forward 3D reconstruction to sequential processing via a causal transformer, rather than relying on pose-graph/global alignment. * While it could use more elaboration, the [reg] token is an interesting mechanism to make the model explicitly aware of the anchor frame. In addition, omitting view embeddings is a novel choice that encourages order-agnostic generalization to different input image orders. ###

Weaknesses

- Contribution may be limited: The gains appear to stem primarily from a causal transformer framework rather than components specific to 3D reconstruction. The paper should clarify what is fundamentally new beyond causal masking and standard transformer design, and what is uniquely tailored to 3D reconstruction. - Anchor design needs more elaboration: The proposed [reg] token is interesting, but there is no controlled comparison to simpler anchors such as a global CLS token or relative view pos

Reviewer 02Rating 6Confidence 5

Strengths

1. The core architectural proposal to use a decoder-only causal transformer with a KVCache is a novel and highly effective paradigm for streaming 3D reconstruction. It offers a more scalable and powerful alternative to prior methods based on RNNs-like structure (like CUT3R) or expensive global optimization (like VGG-T or DUSt3R-GA). 2. The method demonstrates sota competitive performance across a comprehensive suite of benchmarks, including video depth estimation, 3D reconstruction, and camera

Weaknesses

please refer to the weakness part.

Reviewer 03Rating 8Confidence 4

Strengths

The target problem to support feed-forward and streaming 3D reconstruction is timely, the proposed framework also makes sense. Assume the review does not need to consider concurrent works, I think the proposed framework is compatable with existing concurrent works in terms of both performance and novelty.

Weaknesses

I only have some minor concerns: - As shown in table 7, StreamVGGT is attending the comparison which is a concurrent work with similiar key idea. From the results, it seems the proposed method outperforms StreamVGGT, but with no any explanations. I understand this work is a concurrent work. However, the explanation about the differences bettween streamVGGT and the proposed method and why the results of the proposed method are better is helpful for understanding. - In the contribution list, the

Code & Models

Models

🤗
yslan/STream3R
model· 786 dl· ♡ 5
786 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.