Multiview Scene Graph
Juexiao Zhang, Gao Zhu, Sihang Li, Xinhao Liu, Haorui Song, Xinran, Tang, Chen Feng

TL;DR
This paper introduces Multiview Scene Graphs (MSG), a topological scene representation built from unposed images, along with a new dataset, evaluation metrics, and a baseline method demonstrating improved performance in scene understanding tasks.
Contribution
The work presents the first MSG dataset, a novel evaluation metric, and a Transformer-based baseline method for constructing scene graphs from unposed images.
Findings
Proposed MSG dataset enables scene graph evaluation.
New metric based on intersection-over-union for MSG edges.
Baseline method outperforms existing approaches.
Abstract
A proper scene representation is central to the pursuit of spatial intelligence where agents can robustly reconstruct and efficiently understand 3D scenes. A scene representation is either metric, such as landmark maps in 3D reconstruction, 3D bounding boxes in object detection, or voxel grids in occupancy prediction, or topological, such as pose graphs with loop closures in SLAM or visibility graphs in SfM. In this work, we propose to build Multiview Scene Graphs (MSG) from unposed images, representing a scene topologically with interconnected place and object nodes. The task of building MSG is challenging for existing representation learning methods since it needs to jointly address both visual place recognition, object detection, and object association from images with limited fields of view and potentially large viewpoint changes. To evaluate any method tackling this task, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Artificial Intelligence in Games
MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Attention Is All You Need · Linear Layer
