LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping
Nikhil Gosala, K\"ursat Petek, B Ravi Kiran, Senthil Yogamani, Paulo, Drews-Jr, Wolfram Burgard, Abhinav Valada

TL;DR
This paper introduces the first unsupervised method for semantic BEV mapping from monocular images, reducing the need for extensive labeled data by leveraging spatial-temporal consistency and a novel autoencoder.
Contribution
It proposes an unsupervised pretraining approach that independently reasons about scene geometry and semantics, enabling label-efficient semantic BEV map generation.
Findings
Achieves state-of-the-art performance with only 1% of BEV labels.
Uses spatial-temporal consistency for label-free pretraining.
No additional labeled data required for effective BEV mapping.
Abstract
Semantic Bird's Eye View (BEV) maps offer a rich representation with strong occlusion reasoning for various decision making tasks in autonomous driving. However, most BEV mapping approaches employ a fully supervised learning paradigm that relies on large amounts of human-annotated BEV ground truth data. In this work, we address this limitation by proposing the first unsupervised representation learning approach to generate semantic BEV maps from a monocular frontal view (FV) image in a label-efficient manner. Our approach pretrains the network to independently reason about scene geometry and scene semantics using two disjoint neural pathways in an unsupervised manner and then finetunes it for the task of semantic BEV mapping using only a small fraction of labels in the BEV. We achieve label-free pretraining by exploiting spatial and temporal consistency of FV images to learn scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
