Zero-Shot Robotic Manipulation via 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation
Zilong Xie, Jingyu Gong, Xin Tan, Zhizhong Zhang, Yuan Xie

TL;DR
This paper introduces RobMRAG, a novel framework combining 3D Gaussian Splatting and multimodal retrieval to enable zero-shot robotic manipulation with improved generalization and interpretability.
Contribution
The paper proposes a new zero-shot manipulation method integrating 3D Gaussian Splatting with multimodal retrieval and pose refinement, enhancing generalization to unseen objects.
Findings
Achieves 7.76% higher success rate than the best zero-shot baseline.
Outperforms state-of-the-art supervised methods by 6.54%.
Effectively bridges semantic reasoning and geometric execution.
Abstract
Existing end-to-end approaches of robotic manipulation often lack generalization to unseen objects or tasks due to limited data and poor interpretability. While recent Multimodal Large Language Models (MLLMs) demonstrate strong commonsense reasoning, they struggle with geometric and spatial understanding required for pose prediction. In this paper, we propose RobMRAG, a 3D Gaussian Splatting-Enhanced Multimodal Retrieval-Augmented Generation (MRAG) framework for zero-shot robotic manipulation. Specifically, we construct a multi-source manipulation knowledge base containing object contact frames, task completion frames, and pose parameters. During inference, a Hierarchical Multimodal Retrieval module first employs a three-priority hybrid retrieval strategy to find task-relevant object prototypes, then selects the geometrically closest reference example based on pixel-level similarity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
