ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation
Pramit Dutta, Ganesh Sistu, Senthil Yogamani, Edgar Galv\'an, John, McDonald

TL;DR
This paper introduces ViT-BEVSeg, a hierarchical transformer-based network for generating detailed Bird Eye View segmentation maps, showing improved performance over CNN-based methods in autonomous vehicle perception tasks.
Contribution
The paper proposes using vision transformers as a backbone for BEV map generation, replacing traditional CNNs, and demonstrates its effectiveness on the nuScenes dataset.
Findings
Significant performance improvement over CNN-based methods
Effective multi-scale feature representation with vision transformers
Enhanced accuracy in BEV segmentation for autonomous driving
Abstract
Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Neural Network Applications · Video Surveillance and Tracking Methods
MethodsSpatial Transformer
