An Initial Study of Bird's-Eye View Generation for Autonomous Vehicles using Cross-View Transformers
Felipe Carlos dos Santos, Eric Aislan Antonelo, Gustavo Claudio Karl Couto

TL;DR
This paper explores using Cross-View Transformers to generate Bird's-Eye View maps from camera images for autonomous driving, focusing on generalization and robustness in urban environments.
Contribution
It introduces a novel application of Cross-View Transformers for BEV map generation and evaluates their performance across different camera layouts and unseen towns.
Findings
Four-camera CVT with L1 loss performs best in new towns.
CVT demonstrates promising generalization to unseen urban areas.
L1 loss yields more robust BEV maps than focal loss.
Abstract
Bird's-Eye View (BEV) maps provide a structured, top-down abstraction that is crucial for autonomous-driving perception. In this work, we employ Cross-View Transformers (CVT) for learning to map camera images to three BEV's channels - road, lane markings, and planned trajectory - using a realistic simulator for urban driving. Our study examines generalization to unseen towns, the effect of different camera layouts, and two loss formulations (focal and L1). Using training data from only a town, a four-camera CVT trained with the L1 loss delivers the most robust test performance, evaluated in a new town. Overall, our results underscore CVT's promise for mapping camera inputs to reasonably accurate BEV maps.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Human-Automation Interaction and Safety
