Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking
Longlong Jing, Ruichi Yu, Henrik Kretzschmar, Kang Li, Charles R. Qi,, Hang Zhao, Alper Ayvaci, Xu Chen, Dillon Cower, Yingwei Li, Yurong You, Han, Deng, Congcong Li, Dragomir Anguelov

TL;DR
This paper emphasizes the importance of accurate per-object depth estimation in monocular 3D perception and introduces a multi-level fusion approach that significantly improves depth accuracy and downstream detection and tracking performance.
Contribution
The paper proposes a novel multi-level fusion method combining RGB, pseudo-LiDAR, and temporal data to enhance per-object depth estimation in monocular 3D perception.
Findings
Achieves state-of-the-art depth estimation on multiple datasets.
Improves monocular 3D detection and tracking by replacing depth estimates.
Demonstrates the critical impact of depth accuracy on perception performance.
Abstract
Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor bounding the performance. Motivated by this observation, we propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation. Our proposed fusion method achieves the state-of-the-art performance of per-object depth estimation on the Waymo Open Dataset, the KITTI detection dataset, and the KITTI MOT dataset. We further demonstrate that by simply replacing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Vision and Imaging
