Towards Physically Executable 3D Gaussian for Embodied Navigation
Bingchen Miao, Rong Wei, Zhiqi Ge, Xiaoquan sun, Shiqi Gao, Jingzhe Zhu, Renhan Wang, Siliang Tang, Jun Xiao, Rui Tang, Juncheng Li

TL;DR
This paper enhances 3D Gaussian Splatting with semantic and physical features to create executable environments for embodied navigation, improving realism and generalization in visual-language tasks.
Contribution
It introduces SAGE-3D, integrating semantic grounding and physics-aware execution into 3DGS, along with new datasets and benchmarks for VLN.
Findings
Improved baseline performance by 31% on VLN-CE Unseen task.
Enhanced generalizability of 3D scene data.
Created InteriorGS and SAGE-Bench datasets.
Abstract
3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M…
Peer Reviews
Decision·ICLR 2026 Poster
### **Strengths** 1. The paper is well-organized and easy to understand, with clear explanations and intuitive figures illustrating the methodology. 2. The idea makes sense. 3DGS with object-level semantics and physical validity enables photorealistic rendering, semantic instance labeling, and physical interaction modeling. Based on this representation, a realistic and executable simulation environment can be created for embodied AI research, which is highly important. 3. Experiments demonstr
### **Weaknesses and Questions** 1. Creating the dataset appears to rely on detailed mesh scenes and extensive manual annotations, which may incur high costs and limit further scalability. It would be worth exploring more cost-efficient approaches for dataset construction, such as leveraging current vision-based semantic/geometry foundation models for scene reconstruction and semantic annotation. 2. The dataset seems to consist solely of static scenes. Introducing dynamic objects into the scen
Originality: This work pioneers the first-of-its-kind benchmark built on the 3D Gaussian Splatting (3DGS) representation for vision-language navigation (VLN). By repurposing and re-contextualizing 3DGS—traditionally a rendering/novel-view synthesis technique—into the embodied navigation domain, the authors introduce a novel problem formulation and open a promising new research direction. Quality: The paper clearly presents its ideas with structured exposition and strong narrative flow. The meth
Limited reproducibility and generalisation of the paradigm: While the paper presents an interesting direction of adapting 3D Gaussian Splatting (3DGS) for embodied navigation, much of the work depends on manual annotation of objects and artist-created meshes/scene enrichment. The reliance on handcrafted assets makes it difficult for other researchers to easily replicate or scale the setup, and raises questions about how well the approach will generalise to entirely new scenes or domains without
1. The motivation of integrating semantic and physical information into 3D Gaussian splatting is reasonable. 2. The proposed dataset is large in scale and distinguish itself from others with diverse instructions and accurate geometry. 3. The proposed evaluation metrics align well with the real application scenarios.
1. The performance on SAGE-Bench is worse than baselines in terms of CR, ICP and PS.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
