ViT-BEVSeg: A Hierarchical Transformer Network for Monocular   Birds-Eye-View Segmentation

Pramit Dutta; Ganesh Sistu; Senthil Yogamani; Edgar Galv\'an; John; McDonald

arXiv:2205.15667·cs.CV·June 1, 2022·1 cites

ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation

Pramit Dutta, Ganesh Sistu, Senthil Yogamani, Edgar Galv\'an, John, McDonald

PDF

Open Access 1 Repo

TL;DR

This paper introduces ViT-BEVSeg, a hierarchical transformer-based network for generating detailed Bird Eye View segmentation maps, showing improved performance over CNN-based methods in autonomous vehicle perception tasks.

Contribution

The paper proposes using vision transformers as a backbone for BEV map generation, replacing traditional CNNs, and demonstrates its effectiveness on the nuScenes dataset.

Findings

01

Significant performance improvement over CNN-based methods

02

Effective multi-scale feature representation with vision transformers

03

Enhanced accuracy in BEV segmentation for autonomous driving

Abstract

Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

robotvisionmu/vit-bevseg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Neural Network Applications · Video Surveillance and Tracking Methods

MethodsSpatial Transformer