FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion
Xinhao Xiang, Jiawei Zhang

TL;DR
FusionViT is a novel hierarchical vision transformer model that effectively fuses camera and lidar data for improved 3D object detection, achieving state-of-the-art results on KITTI and Waymo datasets.
Contribution
It introduces a pure-ViT hierarchical framework for multi-modal data embedding and fusion in 3D object detection, outperforming existing methods.
Findings
Achieves state-of-the-art detection performance on KITTI and Waymo datasets.
Outperforms existing single-modal and multi-modal fusion approaches.
Demonstrates the effectiveness of hierarchical vision transformer architecture.
Abstract
For 3D object detection, both camera and lidar have been demonstrated to be useful sensory devices for providing complementary information about the same scenery with data representations in different modalities, e.g., 2D RGB image vs 3D point cloud. An effective representation learning and fusion of such multi-modal sensor data is necessary and critical for better 3D object detection performance. To solve the problem, in this paper, we will introduce a novel vision transformer-based 3D object detection model, namely FusionViT. Different from the existing 3D object detection approaches, FusionViT is a pure-ViT based framework, which adopts a hierarchical architecture by extending the transformer model to embed both images and point clouds for effective representation learning. Such multi-modal data embedding representations will be further fused together via a fusion vision transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Infrastructure Maintenance and Monitoring · Advanced Optical Sensing Technologies
MethodsAttention Is All You Need · Softmax · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer
