MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders
Xueying Jiang, Sheng Jin, Xiaoqin Zhang, Ling Shao, Shijian Lu

TL;DR
MonoMAE introduces a depth-aware masked autoencoder approach for monocular 3D detection, effectively handling occlusions by masking and reconstructing object features, leading to improved accuracy and domain generalization.
Contribution
It proposes a novel depth-aware masking and lightweight query completion method to enhance 3D object detection from monocular images, especially under occlusion conditions.
Findings
Achieves superior detection performance on occluded and non-occluded objects.
Learns representations that generalize well across different domains.
Improves 3D localization and identification accuracy.
Abstract
Monocular 3D object detection aims for precise 3D localization and identification of objects from a single-view image. Despite its recent progress, it often struggles while handling pervasive object occlusions that tend to complicate and degrade the prediction of object dimensions, depths, and orientations. We design MonoMAE, a monocular 3D detector inspired by Masked Autoencoders that addresses the object occlusion issue by masking and reconstructing objects in the feature space. MonoMAE consists of two novel designs. The first is depth-aware masking that selectively masks certain parts of non-occluded object queries in the feature space for simulating occluded object queries for network training. It masks non-occluded object queries by balancing the masked and preserved query portions adaptively according to the depth information. The second is lightweight query completion that works…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Neural Network Applications · 3D Surveying and Cultural Heritage
