LAM3D: Leveraging Attention for Monocular 3D Object Detection

Diana-Alexandra Sas; Leandro Di Bella; Yangxintong Lyu; Florin Oniga; and Adrian Munteanu

arXiv:2408.01739·cs.CV·August 6, 2024

LAM3D: Leveraging Attention for Monocular 3D Object Detection

Diana-Alexandra Sas, Leandro Di Bella, Yangxintong Lyu, Florin Oniga, and Adrian Munteanu

PDF

Open Access

TL;DR

LAM3D introduces a novel framework leveraging self-attention within Vision Transformers for monocular 3D object detection, demonstrating improved accuracy on the KITTI benchmark in autonomous driving applications.

Contribution

It presents a new approach that effectively integrates self-attention into 3D detection, outperforming non-attention-based architectures on standard benchmarks.

Findings

01

LAM3D outperforms reference methods on KITTI benchmark.

02

Self-attention improves detection accuracy systematically.

03

The framework is effective for autonomous driving scenarios.

Abstract

Since the introduction of the self-attention mechanism and the adoption of the Transformer architecture for Computer Vision tasks, the Vision Transformer-based architectures gained a lot of popularity in the field, being used for tasks such as image classification, object detection and image segmentation. However, efficiently leveraging the attention mechanism in vision transformers for the Monocular 3D Object Detection task remains an open question. In this paper, we present LAM3D, a framework that Leverages self-Attention mechanism for Monocular 3D object Detection. To do so, the proposed method is built upon a Pyramid Vision Transformer v2 (PVTv2) as feature extraction backbone and 2D/3D detection machinery. We evaluate the proposed method on the KITTI 3D Object Detection Benchmark, proving the applicability of the proposed solution in the autonomous driving domain and outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Industrial Vision Systems and Defect Detection · Brain Tumor Detection and Classification

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Vision Transformer