Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding
Sombit Dey, Ozan Unal, Christos Sakaridis, Luc Van Gool

TL;DR
This paper introduces two novel loss functions and a new architecture for 3D visual grounding, improving the modeling of spatial relations and word-level structure, leading to state-of-the-art results on ReferIt3D.
Contribution
It proposes visual and language span losses along with a bidirectional fusion module to enhance 3D visual grounding performance.
Findings
Achieves competitive results on ReferIt3D benchmark.
Demonstrates improved modeling of spatial relations.
Enhances verbal embedding learning through new losses.
Abstract
3D visual grounding consists of identifying the instance in a 3D scene which is referred by an accompanying language description. While several architectures have been proposed within the commonly employed grounding-by-selection framework, the utilized losses are comparatively under-explored. In particular, most methods rely on a basic supervised cross-entropy loss on the predicted distribution over candidate instances, which fails to model both spatial relations between instances and the internal fine-grained word-level structure of the verbal referral. Sparse attempts to additionally supervise verbal embeddings globally by learning the class of the referred instance from the description or employing verbo-visual contrast to better separate instance embeddings do not fundamentally lift the aforementioned limitations. Responding to these shortcomings, we introduce two novel losses for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Advanced Vision and Imaging · Advanced Neural Network Applications
