Monocular Semantic Occupancy Grid Mapping with Convolutional Variational Encoder-Decoder Networks
Chenyang Lu, Marinus Jacobus Gerardus van de Molengraft, Gijs, Dubbelman

TL;DR
This paper presents a variational encoder-decoder network for monocular semantic occupancy grid mapping that outperforms traditional methods, offers robustness to vehicle dynamics, and operates in real-time.
Contribution
It introduces an end-to-end deep learning approach using a variational encoder-decoder for semantic occupancy mapping from monocular images, improving accuracy and robustness.
Findings
Outperforms deterministic flat-plane assumption by over 12% in mean IoU.
Provides robustness against vehicle dynamic perturbations.
Achieves real-time inference at approximately 35 Hz.
Abstract
In this work, we research and evaluate end-to-end learning of monocular semantic-metric occupancy grid mapping from weak binocular ground truth. The network learns to predict four classes, as well as a camera to bird's eye view mapping. At the core, it utilizes a variational encoder-decoder network that encodes the front-view visual information of the driving scene and subsequently decodes it into a 2-D top-view Cartesian coordinate system. The evaluations on Cityscapes show that the end-to-end learning of semantic-metric occupancy grids outperforms the deterministic mapping approach with flat-plane assumption by more than 12% mean IoU. Furthermore, we show that the variational sampling with a relatively small embedding vector brings robustness against vehicle dynamic perturbations, and generalizability for unseen KITTI data. Our network achieves real-time inference rates of approx. 35…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
