StreamSplat: Towards Online Dynamic 3D Reconstruction from Uncalibrated Video Streams
Zike Wu, Qi Yan, Xuanyu Yi, Lele Wang, Renjie Liao

TL;DR
StreamSplat introduces a fast, online framework for real-time 3D scene reconstruction from uncalibrated video streams, outperforming traditional optimization-based methods in speed and quality.
Contribution
It presents a novel feed-forward approach with probabilistic sampling, bidirectional deformation, and adaptive Gaussian fusion for online dynamic 3D reconstruction.
Findings
Achieves state-of-the-art quality on standard benchmarks.
Supports arbitrarily long video streams with 1200x speedup.
Outperforms optimization-based methods in speed and accuracy.
Abstract
Real-time reconstruction of dynamic 3D scenes from uncalibrated video streams demands robust online methods that recover scene dynamics from sparse observations under strict latency and memory constraints. Yet most dynamic reconstruction methods rely on hours of per-scene optimization under full-sequence access, limiting practical deployment. In this work, we introduce StreamSplat, a fully feed-forward framework that instantly transforms uncalibrated video streams of arbitrary length into dynamic 3D Gaussian Splatting (3DGS) representations in an online manner. It is achieved via three key technical innovations: 1) a probabilistic sampling mechanism that robustly predicts 3D Gaussians from uncalibrated inputs; 2) a bidirectional deformation field that yields reliable associations across frames and mitigates long-term error accumulation; 3) an adaptive Gaussian fusion operation that…
Peer Reviews
Decision·ICLR 2026 Poster
Strength: 1. The paper is well-written and easy to follow. 2. StreamSplat is feedforward and increase the speed of reconstruction. 3. In the figure 4, StreamSpalt shows persistent gaussians across frames, which shows the potential of long-term modeling.
Major Weakness: 1. Lack of Rigorous Evaluation Protocols: The evaluation does not follow established dynamic reconstruction benchmarks such as DyCheck or NVIDIA Dynamic Scene Dataset. The chosen datasets (DAVIS and YouTube-VOS) are more typical for video segmentation or interpolation, not 4D reconstruction. 2. Limited Training Dataset: The paper uses a mix of static (CO3Dv2, RealEstate10K) and limited dynamic (DAVIS, YouTube-VOS) datasets for training. However, DAVIS contains only a few short
* The paper tackles an underexplored yet practically important problem: real-time dynamic 3D reconstruction from uncalibrated video streams, which existing 3DGS and NeRF-based methods generally overlook due to their offline and per-scene optimization nature. * Unlike prior optimization-based dynamic 3DGS methods, StreamSplat introduces a fully feed-forward pipeline that supports online inference without requiring camera calibration or pre-computed poses, making it highly suitable for real-world
* It would be helpful if the authors could clarify whether their framework is capable of predicting or estimating camera poses, given that StreamSplat operates under uncalibrated input conditions. If not, discussing potential extensions in this direction would strengthen the paper's completeness. * The paper would benefit from additional discussion or experiments on highly dynamic scenes with significant topological changes (e.g., frequent object entries and exits from the field of view). It re
- **Impressive empirical results:** The method achieves substantial gains over prior 3DGS and NeRF-based approaches, setting a new benchmark for *online dynamic reconstruction from uncalibrated video streams*. - **Reasonable and clear pipeline:** The proposed two-stage static/dynamic training scheme is well-motivated and reproducible. The combination of a strong image encoder and a dynamic decoder makes architectural sense. - **Excellent efficiency:** StreamSplat attains orders-of-magnitude
Major Weaknesses **W1.** The formulation in *line 157* appears problematic: $(u,v)$ represents pixel coordinates while the offset $o_i$ is in unit space. Their direct addition may be incorrect if the coordinate system is rectilinear. A clarification of this coordinate transformation is needed. **W2.** Algorithm 2’s *aggregation and fusion* step is ambiguous. The operation `UPDATE` in line 228 is not clearly defined—does it simply replace $\tilde{\mathcal{G}}$ with$ \mathcal{G}_{k-1}^+ $, or
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Human Pose and Action Recognition
