SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras
Yingqi Tang, Zhaotie Meng, Erkang Cheng, Haibin Ling

TL;DR
SimPB++ is an end-to-end multi-camera detection framework that jointly perceives 2D and 3D objects, utilizing novel modules for deep interaction and mixed supervision, achieving state-of-the-art results.
Contribution
It introduces a unified model with hybrid decoders and novel interaction modules for simultaneous 2D and 3D detection from multiple cameras.
Findings
Achieves state-of-the-art results on nuScenes for 2D and 3D detection.
Supports long-range detection up to 150 meters on Argoverse2.
Reduces reliance on expensive 3D labels through mixed supervision.
Abstract
Simultaneous perception of 2D objects in perspective view and 3D objects in Bird's Eye View (BEV) is challenging for multi-camera autonomous driving. Existing two-stage pipelines use 2D results only as a one-time cue for 3D detection. We propose SimPB++, which simultaneously detects 2D objects in perspective and 3D objects in BEV from multiple cameras. It unifies both tasks into an end-to-end model with a hybrid decoder architecture, coupling multi-view 2D and 3D decoders interactively. Two novel modules enable deep interaction: Dynamic Query Allocation adaptively assigns 2D queries to 3D candidates, and Adaptive Query Aggregation refines 3D representations using multi-view 2D features, forming a cyclic 3D-2D-3D refinement. For multi-view 2D detection, we use Query-group Attention for intra-group communication. We also design a Crop-and-Scale strategy for long-range perception and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
