CoBEVT: Cooperative Bird's Eye View Semantic Segmentation with Sparse Transformers
Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, Jiaqi Ma

TL;DR
CoBEVT introduces a cooperative multi-agent framework using sparse Transformers for enhanced bird's eye view semantic segmentation, significantly improving perception accuracy and range in autonomous driving scenarios.
Contribution
It is the first to propose a multi-agent, multi-camera perception framework with a novel fused axial attention module for cooperative BEV map prediction.
Findings
Achieves state-of-the-art performance on V2V perception dataset OPV2V.
Demonstrates generalizability to single-agent BEV segmentation and multi-agent 3D detection.
Operates with real-time inference speed.
Abstract
Bird's eye view (BEV) semantic segmentation plays a crucial role in spatial sensing for autonomous driving. Although recent literature has made significant progress on BEV map understanding, they are all based on single-agent camera-based systems. These solutions sometimes have difficulty handling occlusions or detecting distant objects in complex traffic scenes. Vehicle-to-Vehicle (V2V) communication technologies have enabled autonomous vehicles to share sensing information, dramatically improving the perception performance and range compared to single-agent systems. In this paper, we propose CoBEVT, the first generic multi-agent multi-camera perception framework that can cooperatively generate BEV map predictions. To efficiently fuse camera features from multi-view and multi-agent data in an underlying Transformer architecture, we design a fused axial attention module (FAX), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Robotics and Sensor-Based Localization
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing
