TL;DR
HRFuser is a modular multi-resolution sensor fusion architecture for 2D object detection in autonomous vehicles, effectively combining multiple sensor modalities to outperform existing methods.
Contribution
It introduces a novel multi-resolution fusion architecture with a multi-window cross-attention block, scalable to multiple sensor types, advancing multi-modal perception in autonomous driving.
Findings
Significantly improves 2D object detection over camera-only models.
Outperforms state-of-the-art fusion methods on nuScenes and DENSE datasets.
Effectively leverages multiple sensor modalities for robust perception.
Abstract
Besides standard cameras, autonomous vehicles typically include multiple additional sensors, such as lidars and radars, which help acquire richer information for perceiving the content of the driving scene. While several recent works focus on fusing certain pairs of sensors - such as camera with lidar or radar - by using architectural components specific to the examined setting, a generic and modular sensor fusion architecture is missing from the literature. In this work, we propose HRFuser, a modular architecture for multi-modal 2D object detection. It fuses multiple sensors in a multi-resolution fashion and scales to an arbitrary number of input modalities. The design of HRFuser is based on state-of-the-art high-resolution networks for image-only dense prediction and incorporates a novel multi-window cross-attention block as the means to perform fusion of multiple modalities at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
