Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection
Zizhang Wu, Yunzhe Wu, Jian Pu, Xianzhi Li, Xiaoquan Wang

TL;DR
This paper introduces ADD, a novel attention-based knowledge distillation framework with 3D-aware positional encoding, significantly improving monocular 3D object detection accuracy without extra inference costs.
Contribution
The paper proposes a new knowledge distillation method using a teacher with ground-truth depth, featuring 3D-aware self- and cross-attention modules for better 3D feature learning.
Findings
Achieves state-of-the-art results on KITTI benchmark
No additional inference cost over baseline models
Effective across multiple monocular detectors
Abstract
Monocular 3D object detection is a low-cost but challenging task, as it requires generating accurate 3D localization solely from a single image input. Recent developed depth-assisted methods show promising results by using explicit depth maps as intermediate features, which are either precomputed by monocular depth estimation networks or jointly evaluated with 3D object detection. However, inevitable errors from estimated depth priors may lead to misaligned semantic information and 3D localization, hence resulting in feature smearing and suboptimal predictions. To mitigate this issue, we propose ADD, an Attention-based Depth knowledge Distillation framework with 3D-aware positional encoding. Unlike previous knowledge distillation frameworks that adopt stereo- or LiDAR-based teachers, we build up our teacher with identical architecture as the student but with extra ground-truth depth as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Industrial Vision Systems and Defect Detection
MethodsKnowledge Distillation
