Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
Wuqi Su, Huilun Song, Chen Zhao, Chi Xu

TL;DR
This paper introduces a novel hierarchical model with hybrid pyramid feature fusion and a CRF decoder for improved monocular depth estimation, achieving state-of-the-art results efficiently.
Contribution
It proposes a multilevel perceptual CRF model with hybrid feature fusion and a hierarchical awareness adapter, enhancing depth prediction accuracy and computational efficiency.
Findings
Achieves state-of-the-art performance on NYU Depth v2 and KITTI datasets.
Reduces Abs Rel to 0.088 on NYU Depth v2.
Attains near-perfect threshold accuracy on KITTI with 194M parameters.
Abstract
Monocular depth estimation from a single RGB image remains a fundamental challenge in computer vision due to inherent scale ambiguity and the absence of explicit geometric cues. Existing approaches typically rely on increasingly complex network architectures to regress depth maps, which escalates training costs and computational overhead without fully exploiting inter-pixel spatial dependencies. We propose a multilevel perceptual conditional random field (CRF) model built upon the Swin Transformer backbone that addresses these limitations through three synergistic innovations: (1) an adaptive hybrid pyramid feature fusion (HPF) strategy that captures both short-range and long-range dependencies by combining multi-scale spatial pyramid pooling with biaxial feature aggregation, enabling effective integration of global and local contextual information; (2) a hierarchical awareness adapter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
