MOSE: Boosting Vision-based Roadside 3D Object Detection with Scene Cues
Xiahan Chen, Mingjian Chen, Sanli Tang, Yi Niu, Jiang Zhu

TL;DR
This paper introduces MOSE, a novel monocular 3D object detection framework leveraging scene cues and a transformer decoder to improve roadside autonomous driving perception, achieving state-of-the-art results.
Contribution
The paper proposes a scene cue bank and a transformer-based decoder to enhance 3D object detection from roadside cameras, addressing inter-frame consistency and scene invariance.
Findings
Surpasses existing methods on public benchmarks
Achieves significant performance improvements
Demonstrates robustness across diverse scenes
Abstract
3D object detection based on roadside cameras is an additional way for autonomous driving to alleviate the challenges of occlusion and short perception range from vehicle cameras. Previous methods for roadside 3D object detection mainly focus on modeling the depth or height of objects, neglecting the stationary of cameras and the characteristic of inter-frame consistency. In this work, we propose a novel framework, namely MOSE, for MOnocular 3D object detection with Scene cuEs. The scene cues are the frame-invariant scene-specific features, which are crucial for object localization and can be intuitively regarded as the height between the surface of the real road and the virtual ground plane. In the proposed framework, a scene cue bank is designed to aggregate scene cues from multiple frames of the same scene with a carefully designed extrinsic augmentation strategy. Then, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Autonomous Vehicle Technology and Safety · Video Surveillance and Tracking Methods
MethodsFocus
