UniSplat: Unified Spatio-Temporal Fusion via 3D Latent Scaffolds for Dynamic Driving Scene Reconstruction
Chen Shi, Shaoshuai Shi, Xiaoyang Lyu, Chunyang Liu, Kehua Sheng, Bo Zhang, Li Jiang

TL;DR
UniSplat introduces a unified 3D latent scaffold framework for dynamic scene reconstruction in autonomous driving, effectively handling sparse views and scene dynamics to achieve state-of-the-art results.
Contribution
The paper proposes a novel 3D latent scaffold and fusion mechanism that improves dynamic scene reconstruction and novel view synthesis in complex driving environments.
Findings
Achieves state-of-the-art novel view synthesis performance.
Provides robust reconstructions even outside original camera coverage.
Enables streaming scene completion with persistent static scene memory.
Abstract
Feed-forward 3D reconstruction for autonomous driving has advanced rapidly, yet existing methods struggle with the joint challenges of sparse, non-overlapping camera views and complex scene dynamics. We present UniSplat, a general feed-forward framework that learns robust dynamic scene reconstruction through unified latent spatio-temporal fusion. UniSplat constructs a 3D latent scaffold, a structured representation that captures geometric and semantic scene context by leveraging pretrained foundation models. To effectively integrate information across spatial views and temporal frames, we introduce an efficient fusion mechanism that operates directly within the 3D scaffold, enabling consistent spatio-temporal alignment. To ensure complete and detailed reconstructions, we design a dual-branch decoder that generates dynamic-aware Gaussians from the fused scaffold by combining…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper is well written and clearly structured, making the technical ideas easy to follow despite the method’s complexity. * The proposed scaffold-based fusion mechanism is both intuitive and practical, effectively addressing the challenge of modelling dynamic scenes while supporting progressive memory updates. * Experimental evaluation is thorough, including comparisons to multiple strong baselines on two large-scale driving datasets, with consistent quantitative and qualitative improvement
* It is not entirely clear how the scaffold-based fusion mechanism handles dynamic content. For example, when a vehicle moves across frames, how are its features aligned or updated consistently during fusion? * The paper shows novel view rendering by rotating the camera, but it remains unclear whether dynamic actors can be moved or manipulated. Supporting such editing would enable full camera simulation. * Maintaining and updating a dense 3D scaffold could be computationally expensive for long s
- An engineering-first pipeline. This paper proposes three-stage pipeline with a good motivation for 3D scaffold–space fusion, using sparse 3D UNets for spatial aggregation and pose-conditioned temporal accumulation. Overall it's easy to follow. - Dynamic-handling. The proposed method can handle dynamic scenes. - Experimental coverage. Evaluations on Waymo and nuScenes with both quantitative tables and qualitative figures; ablations cover feature composition, spatial vs temporal fusion, and deco
- **A System with Limited Conceptual Novelty**: The primary weakness is the paper's contribution, which appears to be more of a strong engineering effort than a conceptual breakthrough. - The “3D latent scaffold” (spatial-fusion) is essentially a sparse voxel grid with fused geometry+semantic features, a very standard way (fused in 3D grids) for fusing 3D features in many domains (like SLAM, or 3D understanding) and recent generalizable 3DGS/voxel/triplane works. - Warping and fusing featur
Proposes a unified spatio-temporal fusion paradigm in a single latent 3D scaffold, which is a conceptually cleaner and more efficient design than previous separate spatial and temporal modules. The dual-branch Gaussian decoder effectively balances fine-grained details (point) and global completeness (voxel). The dynamic filtering + memory streaming mechanism elegantly addresses out-of-FOV reconstruction and ghosting issues caused by dynamic objects. Strong experimental results and ablations demo
While the framework is conceptually unified, many of its components (voxel scaffold, temporal warping, Gaussian splatting) are adapted from prior work. The dynamic filtering relies on threshold-based heuristics; it is not clear how robust this is under complex motion or sensor noise. The paper lacks a deeper comparison with recent diffusion-based or token-based reconstruction frameworks, which could strengthen the positioning of this method. Some design choices (e.g., the specific form of the du
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
