Time-to-Label: Temporal Consistency for Self-Supervised Monocular 3D Object Detection
Issa Mouawad, Nikolas Brasch, Fabian Manhardt, Federico Tombari,, Francesca Odone

TL;DR
This paper introduces a self-supervised approach leveraging temporal consistency of object poses to improve monocular 3D object detection, reducing annotation needs and achieving competitive results on KITTI3D.
Contribution
It proposes a novel self-supervised loss based on temporal pose consistency to refine pseudo labels for monocular 3D detection.
Findings
Achieves competitive performance on KITTI3D benchmark.
Enhances pose prediction accuracy through temporal consistency.
Reduces reliance on extensive manual annotations.
Abstract
Monocular 3D object detection continues to attract attention due to the cost benefits and wider availability of RGB cameras. Despite the recent advances and the ability to acquire data at scale, annotation cost and complexity still limit the size of 3D object detection datasets in the supervised settings. Self-supervised methods, on the other hand, aim at training deep networks relying on pretext tasks or various consistency constraints. Moreover, other 3D perception tasks (such as depth estimation) have shown the benefits of temporal priors as a self-supervision signal. In this work, we argue that the temporal consistency on the level of object poses, provides an important supervision signal given the strong prior on physical motion. Specifically, we propose a self-supervised loss which uses this consistency, in addition to render-and-compare losses, to refine noisy pose predictions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
