RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
Yue Chang, Rufeng Chen, Zhaofan Zhang, Yi Chen, Yifan Tian, Sihong Xie

TL;DR
This paper introduces RAG-3DSG, a novel method that improves 3D scene graph accuracy by estimating semantic uncertainty through re-shot viewpoints and using retrieval-augmented generation to refine object predictions.
Contribution
The paper presents a re-shot guided uncertainty estimation technique and a retrieval-augmented generation framework to enhance 3D scene graph construction under occlusions and viewpoint constraints.
Findings
Achieves higher recall and precision on benchmark datasets.
Effectively reduces semantic noise in 3D scene graphs.
Demonstrates improved performance in real-world robot trials.
Abstract
Open-vocabulary 3D Scene Graph (3DSG) can enhance various downstream tasks in robotics by leveraging structured semantic representations, yet current 3DSG construction methods suffer from semantic inconsistencies caused by noisy cross-image aggregation under occlusions and constrained viewpoints. To mitigate the impact of such inconsistency, we propose RAG-3DSG, which introduces re-shot guided uncertainty estimation. By measuring the semantic consistency between original limited viewpoints and re-shot optimal viewpoints, this method quantifies the underlying semantic ambiguity of each graph object. Based on this quantification, we devise an Object-level Retrieval-Augmented Generation (RAG) that leverages low-uncertainty objects as semantic anchors to retrieve more reliable contextual knowledge, enabling a Vision-Language Model to rectify the predictions of uncertain objects and optimize…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Under the same base voxel size, our dynamic downsample-mapping strategy shortens the processing time by nearly two-thirds (e.g., from 6.65 s/iter to 2.49 s/iter when the voxel size is set to 0.01 in Replica). - The proposed method achieves comparable or even superior accuracy, reaching the best mAcc score (40.67) while maintaining a competitive f-mIoU (35.65). This demonstrates that our approach not only accelerates the object mapping process but also preserves segmentation quality.
1) The superiority of the developed approach over its SOTA counterparts is unconvincing. Judging by Table 2, well-known methods from ConceptGraphs and OpenMask3D outperform the presented method on a number of metrics. 2) Furthermore, the comparison with existing methods is incomplete. For example, there are other effective methods for solving the problem of open-vocabulary scene graph generation, such as Beyond Bare Queries [1]. [1] Linok, S., Zemskova, T., Ladanova, S., Titkov, R., Yudin, D.,
The authors introduce a dedicated object-level RAG construction to improve node and edge caption in 3D scene graph. The designs are reasonable and well-supported by and experimental results.
1.The writing and organization are unclear. There are many components and steps in the proposed method, but the figure.1 and the text are not enough to explain the technique details, e.g., best render view in sec.3.2.2, it is hard to understand from text. 2.The authors do not provide a thorough evaluation of the effectiveness of the proposed method. The core of the proposed method is the caption refinement based on object-level RAG, the authors should provide a baseline method with designed pr
The paper accurately identifies a critical challenge in existing open-vocabulary 3DSG methods—namely, that semantic noise introduced during multi-view information aggregation, due to suboptimal viewpoints and occlusions, significantly degrades the quality of the resulting scene graphs. This is a widespread and pressing issue in real-world robotic applications. The proposed “re-shot guided uncertainty estimation” is a highly novel and elegant idea. It creatively leverages the advantages of 3D re
Limited Experimental Validation · Single dataset: All experiments are conducted exclusively on Replica—a near-noise-free, synthetic benchmark. The paper’s central claim, however, is robustness to “limited viewpoints, occlusions, and adverse imaging conditions”, which are far more severe in real-world RGB-D streams captured by commodity sensors (e.g., Kinect, RealSense). No evidence is provided on more challenging, real-world datasets such as ScanNet, Matterport3D, or data collected by the auth
- The proposed object-level RAG framework is a clever way to leverage the scene's internal structure for self-correction. Treating confident, low-uncertainty objects as a knowledge base and using them to refine the descriptions of ambiguous, high-uncertainty objects is a effective strategy. - The core idea of using a re-shot image, rendered from an aggregated object point cloud, is a novel and intuitive approach to mitigate issues of occlusion and constrained viewpoints. By generating an image f
- The experiments are performed exclusively on the synthetic Replica dataset. While this is a standard benchmark, it consists of few clean, high-quality indoor environments. The method's effectiveness and robustness remain unproven on real-world datasets like ScanNet or Matterport3D. The re-shot performance on sparse or noisy data from real-world scans is a major open question. - The reported performance on Table 2 are very marginal, if the computational cost is the main goal, it has to be repor
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications
