MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts

Zizhang Wu; Yuanzhu Gan; Lei Wang; Guilian Chen; Jian Pu

arXiv:2302.10549·cs.CV·February 22, 2023

MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts

Zizhang Wu, Yuanzhu Gan, Lei Wang, Guilian Chen, Jian Pu

PDF

Open Access

TL;DR

MonoPGC introduces a novel monocular 3D object detection framework that leverages pixel geometry contexts, depth estimation, and advanced attention mechanisms to improve accuracy and efficiency in autonomous driving scenarios.

Contribution

The paper proposes a new end-to-end framework with depth cross-attention, depth-space-aware transformer, and depth-gradient positional encoding for enhanced monocular 3D detection.

Findings

01

Achieves state-of-the-art performance on KITTI dataset.

02

Effectively integrates pixel geometry with depth information.

03

Improves detection accuracy with novel attention modules.

Abstract

Monocular 3D object detection reveals an economical but challenging task in autonomous driving. Recently center-based monocular methods have developed rapidly with a great trade-off between speed and accuracy, where they usually depend on the object center's depth estimation via 2D features. However, the visual semantic features without sufficient pixel geometry information, may affect the performance of clues for spatial 3D detection tasks. To alleviate this, we propose MonoPGC, a novel end-to-end Monocular 3D object detection framework with rich Pixel Geometry Contexts. We introduce the pixel depth estimation as our auxiliary task and design depth cross-attention pyramid module (DCPM) to inject local and global depth geometry knowledge into visual features. In addition, we present the depth-space-aware transformer (DSAT) to integrate 3D space position and depth-aware features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Neural Network Applications · Robotics and Sensor-Based Localization

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings