SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks
John McCormac, Ankur Handa, Andrew Davison, Stefan Leutenegger

TL;DR
SemanticFusion combines CNN-based semantic predictions with dense SLAM to produce real-time, detailed 3D semantic maps, improving accuracy over single-frame methods for indoor RGB-D video.
Contribution
It introduces a system that fuses CNN semantic predictions with dense SLAM, enabling real-time 3D semantic mapping with improved accuracy.
Findings
Fusing multiple views improves semantic labeling accuracy.
System operates at approximately 25Hz for real-time use.
Enhanced 2D semantic segmentation performance over single-frame predictions.
Abstract
Ever more robust, accurate and detailed mapping using visual sensing has proven to be an enabling factor for mobile robots across a wide variety of applications. For the next level of robot intelligence and intuitive user interaction, maps need extend beyond geometry and appearence - they need to contain semantics. We address this challenge by combining Convolutional Neural Networks (CNNs) and a state of the art dense Simultaneous Localisation and Mapping (SLAM) system, ElasticFusion, which provides long-term dense correspondence between frames of indoor RGB-D video even during loopy scanning trajectories. These correspondences allow the CNN's semantic predictions from multiple view points to be probabilistically fused into a map. This not only produces a useful semantic 3D map, but we also show on the NYUv2 dataset that fusing multiple predictions leads to an improvement even in the 2D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
