Depth Estimation Matters Most: Improving Per-Object Depth Estimation for   Monocular 3D Detection and Tracking

Longlong Jing; Ruichi Yu; Henrik Kretzschmar; Kang Li; Charles R. Qi,; Hang Zhao; Alper Ayvaci; Xu Chen; Dillon Cower; Yingwei Li; Yurong You; Han; Deng; Congcong Li; Dragomir Anguelov

arXiv:2206.03666·cs.CV·June 9, 2022

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

Longlong Jing, Ruichi Yu, Henrik Kretzschmar, Kang Li, Charles R. Qi,, Hang Zhao, Alper Ayvaci, Xu Chen, Dillon Cower, Yingwei Li, Yurong You, Han, Deng, Congcong Li, Dragomir Anguelov

PDF

Open Access

TL;DR

This paper emphasizes the importance of accurate per-object depth estimation in monocular 3D perception and introduces a multi-level fusion approach that significantly improves depth accuracy and downstream detection and tracking performance.

Contribution

The paper proposes a novel multi-level fusion method combining RGB, pseudo-LiDAR, and temporal data to enhance per-object depth estimation in monocular 3D perception.

Findings

01

Achieves state-of-the-art depth estimation on multiple datasets.

02

Improves monocular 3D detection and tracking by replacing depth estimates.

03

Demonstrates the critical impact of depth accuracy on perception performance.

Abstract

Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving. Approaches to monocular 3D perception including detection and tracking, however, often yield inferior performance when compared to LiDAR-based techniques. Through systematic analysis, we identified that per-object depth estimation accuracy is a major factor bounding the performance. Motivated by this observation, we propose a multi-level fusion method that combines different representations (RGB and pseudo-LiDAR) and temporal information across multiple frames for objects (tracklets) to enhance per-object depth estimation. Our proposed fusion method achieves the state-of-the-art performance of per-object depth estimation on the Waymo Open Dataset, the KITTI detection dataset, and the KITTI MOT dataset. We further demonstrate that by simply replacing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Advanced Vision and Imaging