Dense Object Grounding in 3D Scenes
Wencan Huang, Daizong Liu, Wei Hu

TL;DR
This paper introduces the task of 3D Dense Object Grounding (3D DOG), which localizes multiple objects described in complex paragraphs in 3D scenes, using a novel transformer-based framework to improve accuracy over existing methods.
Contribution
The paper proposes a new challenging task 3D DOG and a novel Stacked Transformer framework, 3DOGSFormer, for more accurate multi-object localization in 3D scenes based on contextual relationships.
Findings
Outperforms state-of-the-art single-object grounding methods.
Achieves significant improvements on Nr3D, Sr3D, and ScanRefer benchmarks.
Effectively models semantic and spatial relationships among densely referred objects.
Abstract
Localizing objects in 3D scenes according to the semantics of a given natural language is a fundamental yet important task in the field of multimedia understanding, which benefits various real-world applications such as robotics and autonomous driving. However, the majority of existing 3D object grounding methods are restricted to a single-sentence input describing an individual object, which cannot comprehend and reason more contextualized descriptions of multiple objects in more practical 3D cases. To this end, we introduce a new challenging task, called 3D Dense Object Grounding (3D DOG), to jointly localize multiple objects described in a more complicated paragraph rather than a single sentence. Instead of naively localizing each sentence-guided object independently, we found that dense objects described in the same paragraph are often semantically related and spatially located in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings
