Occlusion Boundary and Depth: Mutual Enhancement via Multi-Task Learning
Lintao Xu, Yinghao Wang, Chaohui Wang

TL;DR
This paper introduces MoDOT, a multi-task learning framework that jointly estimates occlusion boundaries and depth from a single image, leveraging their mutual relationship to improve accuracy and generalization.
Contribution
The paper proposes a novel multi-task framework with a Cross-Attention Strip Module and a geometric consistency loss for joint occlusion boundary and depth estimation, along with a new dataset OB-Hypersim.
Findings
MoDOT outperforms single-task and multi-task baselines on synthetic datasets and NYUD-v2.
Models trained on synthetic data generalize well to real-world scenes without fine-tuning.
Joint modeling produces sharper boundaries and better geometric fidelity in depth maps.
Abstract
Occlusion Boundary Estimation (OBE) identifies boundaries arising from both inter-object occlusions and self-occlusion within individual objects. This task is closely related to Monocular Depth Estimation (MDE), which infers depth from a single image, as Occlusion Boundaries (OBs) provide critical geometric cues for resolving depth ambiguities, while depth can conversely refine occlusion reasoning. In this paper, we aim to systematically model and exploit this mutually beneficial relationship. To this end, we propose MoDOT, a novel framework for joint estimation of depth and OBs, which incorporates a new Cross-Attention Strip Module (CASM) to leverage mid-level OB features for depth prediction, and a novel OB-Depth Constraint Loss (OBDCL) to enforce geometric consistency. To facilitate this study, we contribute OB-Hypersim, a large-scale photorealistic dataset with precise depth and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Human Motion and Animation
