Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth
Kyumin Hwang, Wonhyeok Choi, Kiljoon Han, Wonjoon Choi, Minwoo Choi, Yongcheon Na, Minwoo Park, Sunghoon Im

TL;DR
This paper introduces a novel knowledge distillation approach for full surround monocular depth estimation that improves scale-invariance, cross-view consistency, and real-time performance by transferring robust depth knowledge from foundation models to lightweight networks.
Contribution
It proposes a hybrid regression framework with cross-interaction and view-relational knowledge distillation to enhance scale-invariant and view-consistent depth estimation in real-time.
Findings
Outperforms conventional supervised methods on DDAD and nuScenes datasets.
Achieves a good balance between accuracy and computational efficiency.
Enables real-time full surround monocular depth estimation with improved scale and view consistency.
Abstract
Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
