TL;DR
Flash-Mono introduces a feed-forward Gaussian Splatting SLAM system that significantly improves speed and accuracy by predicting Gaussian attributes directly from multi-frame context, bypassing traditional optimization.
Contribution
The paper proposes a novel feed-forward architecture with a recurrent frontend for monocular SLAM, enabling 10x faster processing and improved geometric fidelity compared to prior optimization-based methods.
Findings
Achieves 10x speedup over traditional GS-SLAM methods.
Maintains high-quality rendering and mapping accuracy.
Enables efficient loop closure using hidden states as submap descriptors.
Abstract
Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
