Real-Time Multi-Modal Semantic Fusion on Unmanned Aerial Vehicles
Simon Bultmann, Jan Quenzel, Sven Behnke

TL;DR
This paper presents a real-time UAV system that fuses multi-modal sensor data, including LiDAR, RGB, and thermal images, for semantic scene analysis, enabling fast autonomous decision-making in complex environments.
Contribution
It introduces a lightweight, real-time multi-modal semantic fusion system on UAVs using embedded inference accelerators and a late fusion approach for enhanced scene understanding.
Findings
Achieves approximately 9 Hz processing rate for semantic inference.
Successfully demonstrates real-world urban environment experiments.
Provides augmented semantic images and point clouds for improved scene analysis.
Abstract
Unmanned aerial vehicles (UAVs) equipped with multiple complementary sensors have tremendous potential for fast autonomous or remote-controlled semantic scene analysis, e.g., for disaster examination. In this work, we propose a UAV system for real-time semantic inference and fusion of multiple sensor modalities. Semantic segmentation of LiDAR scans and RGB images, as well as object detection on RGB and thermal images, run online onboard the UAV computer using lightweight CNN architectures and embedded inference accelerators. We follow a late fusion approach where semantic information from multiple modalities augments 3D point clouds and image segmentation masks while also generating an allocentric semantic map. Our system provides augmented semantic images and point clouds with 9Hz. We evaluate the integrated system in real-world experiments in an urban environment.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
