DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting
Hongyang Li, Hao Zhang, Zhaoyang Zeng, Shilong Liu, Feng Li, Tianhe, Ren, and Lei Zhang

TL;DR
This paper introduces DFA3D, a novel 3D deformable attention operator that enhances 2D-to-3D feature lifting for improved 3D object detection, effectively addressing depth ambiguity and refining features through a Transformer-like architecture.
Contribution
We propose DFA3D, a new operator for 2D-to-3D feature lifting that alleviates depth ambiguity and refines features iteratively, with a memory-efficient implementation and demonstrated improvements on nuScenes.
Findings
+1.41% mAP improvement on nuScenes
Up to +15.1% mAP with high-quality depth
Effective alleviation of depth ambiguity
Abstract
In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Human Pose and Action Recognition
