Multi-Object 3D Grounding with Dynamic Modules and Language-Informed   Spatial Attention

Haomeng Zhang; Chiao-An Yang; Raymond A. Yeh

arXiv:2410.22306·cs.CV·December 23, 2024

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention

Haomeng Zhang, Chiao-An Yang, Raymond A. Yeh

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents D-LISA, a novel two-stage method for multi-object 3D grounding that leverages dynamic modules and language-informed spatial attention to improve localization accuracy in point clouds.

Contribution

The paper introduces D-LISA, incorporating dynamic proposal generation, camera positioning, and language-informed attention, advancing multi-object 3D grounding techniques.

Findings

01

Outperforms state-of-the-art by 12.8% in multi-object 3D grounding

02

Achieves competitive results in single-object 3D grounding

03

Demonstrates effectiveness of dynamic modules and language-informed attention

Abstract

Multi-object 3D Grounding involves locating 3D boxes based on a given query phrase from a point cloud. It is a challenging and significant task with numerous applications in visual understanding, human-computer interaction, and robotics. To tackle this challenge, we introduce D-LISA, a two-stage approach incorporating three innovations. First, a dynamic vision module that enables a variable and learnable number of box proposals. Second, a dynamic camera positioning that extracts features for each proposal. Third, a language-informed spatial attention module that better reasons over the proposals to output the final prediction. Empirically, experiments show that our method outperforms the state-of-the-art methods on multi-object 3D grounding by 12.8% (absolute) and is competitive in single-object 3D grounding.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haomengz/D-LISA
pytorchOfficial

Videos

Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Constraint Satisfaction and Optimization

MethodsSoftmax · Attention Is All You Need · Max Pooling · Convolution · Sigmoid Activation · Average Pooling