TL;DR
This paper introduces a novel end-to-end framework for monocular depth estimation in dynamic scenes that models 6-DoF object motion, ego-motion, and depth without supervision, using instance-aware consistency and auto-annotation.
Contribution
It proposes a geometrically correct projection pipeline, a unified consistency loss, and an auto-annotation scheme for training without ground truth labels.
Findings
Outperforms state-of-the-art methods on KITTI and Cityscapes datasets.
Effectively models multiple dynamic objects and ego-motion in monocular depth estimation.
Validated through extensive ablation studies.
Abstract
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
