M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM
Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang, Mulin Yu, Bo Dai

TL;DR
M^3 introduces a dense matching approach within a multi-view foundation model to significantly improve monocular SLAM accuracy and robustness in dynamic environments, achieving state-of-the-art results.
Contribution
It develops a novel M^3 framework that combines dense correspondence estimation with monocular SLAM, enhancing pose accuracy and scene reconstruction quality.
Findings
Reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0
Outperforms ARTDECO by 2.11 dB in PSNR on ScanNet++
Achieves state-of-the-art accuracy in diverse benchmarks
Abstract
Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
