RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

Naman Patel; Prashanth Krishnamurthy; Farshad Khorrami

arXiv:2505.15373·cs.CV·May 22, 2025

RAZER: Robust Accelerated Zero-Shot 3D Open-Vocabulary Panoptic Reconstruction with Spatio-Temporal Aggregation

Naman Patel, Prashanth Krishnamurthy, Farshad Khorrami

PDF

Open Access

TL;DR

RAZER introduces a training-free, real-time 3D scene understanding system that combines geometric reconstruction with open-vocabulary semantic mapping, enabling zero-shot object recognition and natural language interaction in complex environments.

Contribution

It presents a novel zero-shot framework that integrates GPU-accelerated 3D reconstruction with vision-language models for open-vocabulary semantic mapping without training.

Findings

01

Achieves superior zero-shot 3D instance retrieval and segmentation.

02

Handles 2D segmentation inconsistencies robustly.

03

Supports real-time natural language queries in 3D environments.

Abstract

Mapping and understanding complex 3D environments is fundamental to how autonomous systems perceive and interact with the physical world, requiring both precise geometric reconstruction and rich semantic comprehension. While existing 3D semantic mapping systems excel at reconstructing and identifying predefined object instances, they lack the flexibility to efficiently build semantic maps with open-vocabulary during online operation. Although recent vision-language models have enabled open-vocabulary object recognition in 2D images, they haven't yet bridged the gap to 3D spatial understanding. The critical challenge lies in developing a training-free unified system that can simultaneously construct accurate 3D maps while maintaining semantic consistency and supporting natural language interactions in real time. In this paper, we develop a zero-shot framework that seamlessly integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Image and Video Retrieval Techniques