CLFT: Camera-LiDAR Fusion Transformer for Semantic Segmentation in Autonomous Driving
Junyi Gu, Mauro Bellone, Tom\'a\v{s} Pivo\v{n}ka, and Raivo Sell

TL;DR
This paper introduces CLFT, a vision-transformer-based camera-LiDAR fusion network for semantic segmentation in autonomous driving, demonstrating robustness and improved performance in challenging weather conditions.
Contribution
The paper presents a novel progressive-assemble and cross-fusion strategy for vision transformers in multimodal sensor fusion for autonomous driving.
Findings
Up to 10% improvement in dark-wet conditions over FCN-based fusion networks.
5-10% overall improvement compared to single-modality transformer backbones.
Robust performance in rain and low illumination conditions.
Abstract
Critical research about camera-and-LiDAR-based semantic object segmentation for autonomous driving significantly benefited from the recent development of deep learning. Specifically, the vision transformer is the novel ground-breaker that successfully brought the multi-head-attention mechanism to computer vision applications. Therefore, we propose a vision-transformer-based network to carry out camera-LiDAR fusion for semantic segmentation applied to autonomous driving. Our proposal uses the novel progressive-assemble strategy of vision transformers on a double-direction network and then integrates the results in a cross-fusion strategy over the transformer decoder layers. Unlike other works in the literature, our camera-LiDAR fusion transformers have been evaluated in challenging conditions like rain and low illumination, showing robust performance. The paper reports the segmentation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Residual Connection · Softmax · Vision Transformer
