PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding   for 3D Visual Grounding

Chenshu Hou; Liang Peng; Xiaopei Wu; Xiaofei He; Wenxiao Wang

arXiv:2407.14491·cs.CV·September 4, 2024

PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding

Chenshu Hou, Liang Peng, Xiaopei Wu, Xiaofei He, Wenxiao Wang

PDF

Open Access

TL;DR

This paper introduces PD-APE, a dual-branch decoding framework with adaptive position encoding for 3D visual grounding, effectively separating target object and environment understanding to improve accuracy.

Contribution

The proposed PD-APE framework uniquely decouples object and environment decoding with adaptive position encoding, enhancing focus and performance in 3D visual grounding tasks.

Findings

01

Outperforms state-of-the-art on ScanRefer dataset

02

Achieves superior results on Nr3D dataset

03

Demonstrates effective separation of object and environment attention

Abstract

3D visual grounding aims to identify objects in 3D point cloud scenes that match specific natural language descriptions. This requires the model to not only focus on the target object itself but also to consider the surrounding environment to determine whether the descriptions are met. Most previous works attempt to accomplish both tasks within the same module, which can easily lead to a distraction of attention. To this end, we propose PD-APE, a dual-branch decoding framework that separately decodes target object attributes and surrounding layouts. Specifically, in the target object branch, the decoder processes text tokens that describe features of the target object (e.g., category and color), guiding the queries to pay attention to the target object itself. In the surrounding branch, the queries align with other text tokens that carry surrounding environment information, making the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization · Robotics and Automated Systems

MethodsSoftmax · Attention Is All You Need · Focus · ALIGN