Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, Xiaoming Liu

TL;DR
MonoCoP introduces a novel adaptive framework for monocular 3D detection that leverages inter-attribute correlations and uncertainty guidance to improve depth estimation accuracy.
Contribution
It proposes MonoCoP, combining a Chain-of-Prediction and an Uncertainty-Guided Selector to dynamically utilize attribute correlations, achieving state-of-the-art results.
Findings
MonoCoP outperforms previous methods on KITTI, nuScenes, and Waymo datasets.
Significant improvement in depth accuracy for distant objects.
Effective dynamic switching between correlation exploitation and parallel prediction.
Abstract
Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image. Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity. Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection. However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors. Therefore, neither parallel nor sequential prediction is optimal. In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs. A Chain-of-Prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
