MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts
Zizhang Wu, Yuanzhu Gan, Lei Wang, Guilian Chen, Jian Pu

TL;DR
MonoPGC introduces a novel monocular 3D object detection framework that leverages pixel geometry contexts, depth estimation, and advanced attention mechanisms to improve accuracy and efficiency in autonomous driving scenarios.
Contribution
The paper proposes a new end-to-end framework with depth cross-attention, depth-space-aware transformer, and depth-gradient positional encoding for enhanced monocular 3D detection.
Findings
Achieves state-of-the-art performance on KITTI dataset.
Effectively integrates pixel geometry with depth information.
Improves detection accuracy with novel attention modules.
Abstract
Monocular 3D object detection reveals an economical but challenging task in autonomous driving. Recently center-based monocular methods have developed rapidly with a great trade-off between speed and accuracy, where they usually depend on the object center's depth estimation via 2D features. However, the visual semantic features without sufficient pixel geometry information, may affect the performance of clues for spatial 3D detection tasks. To alleviate this, we propose MonoPGC, a novel end-to-end Monocular 3D object detection framework with rich Pixel Geometry Contexts. We introduce the pixel depth estimation as our auxiliary task and design depth cross-attention pyramid module (DCPM) to inject local and global depth geometry knowledge into visual features. In addition, we present the depth-space-aware transformer (DSAT) to integrate 3D space position and depth-aware features…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
