M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation
Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar,, Sanja Fidler, Ping Luo, Jose M. Alvarez

TL;DR
M$^2$BEV introduces a unified multi-camera framework that jointly performs 3D detection and segmentation in BEV space, improving efficiency and accuracy in camera-based perception tasks.
Contribution
The paper presents a novel unified model for joint 3D detection and segmentation in BEV space, with four key design innovations enhancing performance and efficiency.
Findings
Achieves state-of-the-art results on nuScenes dataset.
Outperforms previous methods in both detection and segmentation metrics.
Enables higher resolution inputs with faster inference speeds.
Abstract
In this paper, we propose MBEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, MBEV infers both tasks with a unified model and improves efficiency. MBEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
