TL;DR
This paper introduces M^3PT, a multi-modal masked pre-training approach for panoramic depth completion, significantly improving dense depth recovery from sparse data and RGB images.
Contribution
It is the first to apply masked pre-training to a multi-modal vision task, enhancing panoramic depth completion performance without changing network architecture.
Findings
Achieves up to 51.7% reduction in MRE
Improves RMSE by 26.2% over baselines
Effective across three panoramic datasets
Abstract
In this paper, we formulate a potentially valuable panoramic depth completion (PDC) task as panoramic 3D cameras often produce 360{\deg} depth with missing data in complex scenes. Its goal is to recover dense panoramic depths from raw sparse ones and panoramic RGB images. To deal with the PDC task, we train a deep network that takes both depth and image as inputs for the dense panoramic depth recovery. However, it needs to face a challenging optimization problem of the network parameters due to its non-convex objective function. To address this problem, we propose a simple yet effective approach termed M{^3}PT: multi-modal masked pre-training. Specifically, during pre-training, we simultaneously cover up patches of the panoramic RGB image and sparse depth by shared random mask, then reconstruct the sparse depth in the masked regions. To our best knowledge, it is the first time that we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMasked autoencoder
