GS4: Generalizable Sparse Splatting Semantic SLAM
Mingqi Jiang, Chanho Kim, Chen Ziwen, Li Fuxin

TL;DR
GS4 introduces a fast, efficient, and generalizable semantic SLAM system that leverages Gaussian splatting for dense 3D mapping, outperforming prior methods in speed, memory, and accuracy.
Contribution
GS4 is the first generalizable Gaussian splatting-based semantic SLAM system that runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art results.
Findings
Runs 10x faster than prior methods
Uses 10x fewer Gaussians
Achieves state-of-the-art semantic SLAM performance
Abstract
Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well written and structured. The ablation study clearly shows the contribution of each module - GS4 removes the need for per-scene optimization by introducing a feed-forward Gaussian prediction and refinement pipeline, which simplifies deployment and enables real-time performance on standard RGB-D inputs. - The proposed refinement and pruning strategy effectively maintains scene quality while reducing the number of Gaussians by roughly one order of magnitude compared to existing
- The terminology “monocular RGB-D” is confusing. The paper repeatedly refers to “monocular RGB-D sequences,” which is inconsistent with the common definition of a monocular system that does not include depth sensing. In fact, datasets such as TUM RGB-D use the Microsoft Kinect v1 sensor, where depth is computed via a stereo setup. For clarity and technical correctness, it is recommended to simply use the term “RGB-D system” throughout the paper. - Tracking evaluation does not reflect GS4’s con
1. The paper presents strong innovation in reconstruction. It provides a new perspective on combining feed-forward models and SLAM, which is conceptually inspiring and technically interesting. 2. The overall structure is clear and easy to follow. The paper is well-organized and logically written, making the technical content accessible even to readers not deeply familiar with Gaussian Splatting. 3. The writing quality is fluent and professional, with clear English exposition throughout.
1. Experimental evaluation is relatively weak. The main experiments are conducted on only six test scenes from the ScanNet dataset. This makes it hard to convincingly demonstrate the model’s effectiveness and generalization. It would be more appropriate to adopt standard benchmarks commonly used for 3DGS-based SLAM systems. In addition, the experiments on NYUv2 and TUM RGB-D are labeled as zero-shot, which may not be the most suitable term for a SLAM system. The use of this term feels inconsiste
1. The paper is well-organized and easy to follow. 2. Introducing a generalizable 3D Gaussian module into SLAM is a promising idea, as it can reduce mapping time compared to other 3DGS-based SLAM systems. 3. The proposed method effectively reduces the total number of 3D Gaussians required to represent the entire scene while maintaining high rendering performance.
1. This SLAM module relies on the state-of-the-art tracking component from GO-SLAM, which may lead to an unfair comparison with other 3DGS-based SLAM methods. For instance, SplaTAM performs both tracking and mapping using its learned 3D Gaussian map. Therefore, it would be important to clarify whether other 3DGS-based methods could also achieve improved rendering performance if they adopted the same tracking module from GO-SLAM. 2. The proposed SLAM framework primarily focuses on enhancing mappi
While prior Gaussian Splatting (GS) SLAM systems (e.g., SplaTAM, SGS-SLAM) rely on per-scene optimization, GS4 replaces this with a feed-forward, generalizable network that predicts semantic Gaussians directly from RGB-D inputs. This shifts the paradigm from scene-specific reconstruction to learned, zero-shot mapping—a significant conceptual leap. Unlike prior semantic SLAM methods that pipeline segmentation models (e.g., Mask2Former) separately from geometry estimation, GS4 unifies color, dep
While the paper claims zero-shot semantic generalization to NYUv2 and TUM RGB-D, these datasets share the same indoor domain and similar class taxonomy (e.g., chairs, desks, walls) as ScanNet. There is no evaluation on domain-shifted semantics, such as outdoor scenes (e.g., KITTI with vehicle/pedestrian classes) or open-vocabulary settings (e.g., using CLIP embeddings as in OVO-SLAM). While GS4 is “10x faster” than baselines, FPS numbers (Table 4) hide critical details: Is speedup due to fewer
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · 3D Surveying and Cultural Heritage
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · 1x1 Convolution · Convolution · Thinned U-shape Module
