TL;DR
DVPE introduces a divided view position embedding method for multi-view 3D object detection, effectively balancing receptive field expansion and interference reduction, while incorporating temporal information for state-of-the-art results.
Contribution
The paper proposes a novel divided view position embedding approach that decouples position encoding from camera poses and integrates temporal features, improving multi-view 3D detection performance.
Findings
Achieves 57.2% mAP and 64.5% NDS on nuScenes
Reduces interference in multi-view feature aggregation
Enhances training stability with a one-to-many assignment strategy
Abstract
Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating multi-view features. Moreover, different poses of cameras present challenges in training global attention models. To address these problems, this paper proposes a divided view method, in which features are modeled globally via the visibility crossattention mechanism, but interact only with partial features in a divided local virtual space. This effectively reduces interference from other irrelevant features and alleviates the training difficulties of the transformer by decoupling the position embedding from camera poses. Additionally, 2D historical RoI features are incorporated into the object-centric temporal modeling to utilize highlevel visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
