VINGS-Mono: Visual-Inertial Gaussian Splatting Monocular SLAM in Large Scenes
Ke Wu, Zicheng Zhang, Muer Tie, Ziqing Ai, Zhongxue Gan, Wenchao Ding

TL;DR
VINGS-Mono introduces a monocular SLAM framework using Gaussian Splatting for large-scale outdoor scenes, achieving real-time high-quality mapping and localization with a novel loop closure and dynamic object handling.
Contribution
It is the first monocular Gaussian SLAM system capable of large outdoor scene mapping and real-time operation on a smartphone, integrating novel loop closure and dynamic object removal.
Findings
Achieves localization comparable to visual-inertial odometry.
Outperforms recent Gaussian Splatting/NeRF SLAM methods in mapping quality.
Operates in real-time on mobile devices.
Abstract
VINGS-Mono is a monocular (inertial) Gaussian Splatting (GS) SLAM framework designed for large scenes. The framework comprises four main components: VIO Front End, 2D Gaussian Map, NVS Loop Closure, and Dynamic Eraser. In the VIO Front End, RGB frames are processed through dense bundle adjustment and uncertainty estimation to extract scene geometry and poses. Based on this output, the mapping module incrementally constructs and maintains a 2D Gaussian map. Key components of the 2D Gaussian Map include a Sample-based Rasterizer, Score Manager, and Pose Refinement, which collectively improve mapping speed and localization accuracy. This enables the SLAM system to handle large-scale urban environments with up to 50 million Gaussian ellipsoids. To ensure global consistency in large-scale scenes, we design a Loop Closure module, which innovatively leverages the Novel View Synthesis (NVS)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Gaze Tracking and Assistive Technology · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
