Fast-BEV: A Fast and Strong Bird's-Eye View Perception Baseline
Yangguang Li, Bin Huang, Zeren Chen, Yufeng Cui, Feng Liang, Mingzhu, Shen, Fenggang Liu, Enze Xie, Lu Sheng, Wanli Ouyang, Jing Shao

TL;DR
Fast-BEV introduces a lightweight, efficient BEV perception framework for autonomous vehicles that achieves high speed and competitive accuracy without relying on complex transformer-based transformations.
Contribution
The paper proposes a novel, resource-efficient BEV perception framework that eliminates the need for expensive transformations, enabling faster inference on on-vehicle chips.
Findings
R50 model runs at 52.6 FPS with 47.3% NDS on nuScenes.
Outperforms existing BEV methods in speed and accuracy.
Largest model achieves 53.5% NDS, demonstrating strong performance.
Abstract
Recently, perception task based on Bird's-Eye View (BEV) representation has drawn more and more attention, and BEV representation is promising as the foundation for next-generation Autonomous Vehicle (AV) perception. However, most existing BEV solutions either require considerable resources to execute on-vehicle inference or suffer from modest performance. This paper proposes a simple yet effective framework, termed Fast-BEV , which is capable of performing faster BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive transformer based transformation nor depth representation. Our Fast-BEV consists of five parts, We novelly propose (1) a lightweight deployment-friendly view transformation which fast transfers 2D image feature to 3D voxel space, (2) an multi-scale image encoder which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
