VIZOR: Viewpoint-Invariant Zero-Shot Scene Graph Generation for 3D Scene Reasoning
Vivek Madhavaram, Vartika Sengar, Arkadipta De, Charu Sharma

TL;DR
VIZOR is a training-free, end-to-end framework that generates viewpoint-invariant 3D scene graphs with open-vocabulary relationships, improving generalization and accuracy in scene understanding and reasoning tasks.
Contribution
VIZOR introduces a novel zero-shot, viewpoint-invariant scene graph generation method directly from raw 3D data, without requiring training or annotated relationships.
Findings
Outperforms state-of-the-art in scene graph generation
Achieves 22% and 4.81% improvements in zero-shot grounding accuracy on two datasets
Provides consistent spatial relationships regardless of viewpoint
Abstract
Scene understanding and reasoning has been a fundamental problem in 3D computer vision, requiring models to identify objects, their properties, and spatial or comparative relationships among the objects. Existing approaches enable this by creating scene graphs using multiple inputs such as 2D images, depth maps, object labels, and annotated relationships from specific reference view. However, these methods often struggle with generalization and produce inaccurate spatial relationships like "left/right", which become inconsistent across different viewpoints. To address these limitations, we propose Viewpoint-Invariant Zero-shot scene graph generation for 3D scene Reasoning (VIZOR). VIZOR is a training-free, end-to-end framework that constructs dense, viewpoint-invariant 3D scene graphs directly from raw 3D scenes. The generated scene graph is unambiguous, as spatial relationships are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
