Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer
Shaoyu Chen, Tianheng Cheng, Xinggang Wang, Wenming Meng and, Qian Zhang, Wenyu Liu

TL;DR
This paper introduces a Geometry-guided Kernel Transformer (GKT) for efficient, robust 2D-to-BEV representation learning from surround-view cameras, achieving real-time performance and state-of-the-art segmentation accuracy for autonomous driving.
Contribution
The paper proposes a novel GKT mechanism that incorporates geometric priors and a LUT indexing method, enabling fast, robust, and accurate BEV perception without camera calibration at runtime.
Findings
GKT runs at 72.3 FPS on 3090 GPU and 45.6 FPS on 2080ti GPU.
Achieves 38.0 mIoU on nuScenes validation set.
Demonstrates robustness to camera deviations and predefined BEV height.
Abstract
Learning Bird's Eye View (BEV) representation from surrounding-view cameras is of great importance for autonomous driving. In this work, we propose a Geometry-guided Kernel Transformer (GKT), a novel 2D-to-BEV representation learning mechanism. GKT leverages the geometric priors to guide the transformer to focus on discriminative regions and unfolds kernel features to generate BEV representation. For fast inference, we further introduce a look-up table (LUT) indexing method to get rid of the camera's calibrated parameters at runtime. GKT can run at FPS on 3090 GPU / FPS on 2080ti GPU and is robust to the camera deviation and the predefined BEV height. And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0 mIoU (100m100m perception range at a 0.5m resolution) on the nuScenes val set. Given the efficiency, effectiveness, and robustness, GKT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Label Smoothing · Softmax · Byte Pair Encoding · Adam · Dropout · Residual Connection
