S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection
Xuan He, Jin Yuan, Kailun Yang, Zhenchao Zeng, Zhiyong Li

TL;DR
This paper introduces S$^3$-MonoDETR, a novel deformable transformer module that uses supervised shape and scale perception to improve monocular 3D object detection accuracy across multiple categories.
Contribution
The paper proposes a new S$^3$-DA module with shape&scale perception and a MSM loss, enhancing query feature robustness and detection performance in monocular 3D detection.
Findings
Achieves state-of-the-art results on KITTI and Waymo datasets.
Effectively detects multiple object categories in a single training.
Significantly improves detection accuracy over existing methods.
Abstract
Recently, transformer-based methods have shown exceptional performance in monocular 3D object detection, which can predict 3D attributes from a single 2D image. These methods typically use visual and depth representations to generate query points on objects, whose quality plays a decisive role in the detection accuracy. However, current unsupervised attention mechanisms without any geometry appearance awareness in transformers are susceptible to producing noisy features for query points, which severely limits the network performance and also makes the model have a poor ability to detect multi-category objects in a single training process. To tackle this problem, this paper proposes a novel ``Supervised Shape&Scale-perceptive Deformable Attention'' (S-DA) module for monocular 3D object detection. Concretely, S-DA utilizes visual and depth features to generate diverse local…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
