GeoSurDepth: Harnessing Foundation Model for Spatial Geometry Consistency-Oriented Self-Supervised Surround-View Depth Estimation

Weimin Liu; Wenjun Wang; Joshua H. Meng

arXiv:2601.05839·cs.CV·January 21, 2026

GeoSurDepth: Harnessing Foundation Model for Spatial Geometry Consistency-Oriented Self-Supervised Surround-View Depth Estimation

Weimin Liu, Wenjun Wang, Joshua H. Meng

PDF

Open Access

TL;DR

GeoSurDepth introduces a self-supervised surround-view depth estimation framework that leverages geometric consistency and foundation models to improve 3D scene understanding in autonomous driving.

Contribution

It is the first to explicitly exploit geometric structure using foundation models and a novel view synthesis pipeline for surround-view depth estimation.

Findings

01

Achieves state-of-the-art performance on KITTI, DDAD, and nuScenes datasets.

02

Effectively maintains surface normal consistency in 3D space.

03

Enhances depth estimation robustness through geometry-aware supervision.

Abstract

Accurate surround-view depth estimation provides a competitive alternative to laser-based sensors and is essential for 3D scene understanding in autonomous driving. While empirical studies have proposed various approaches that primarily focus on enforcing cross-view constraints at photometric level, few explicitly exploit the rich geometric structure inherent in both monocular and surround-view setting. In this work, we propose GeoSurDepth, a framework that leverages geometry consistency as the primary cue for surround-view depth estimation. Concretely, we utilize vision foundation models as pseudo geometry priors and feature representation enhancement tool to guide the network to maintain surface normal consistency in spatial 3D space and regularize object- and texture-consistent depth estimation in 2D. In addition, we introduce a novel view synthesis pipeline where 2D-3D lifting is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Robotics and Sensor-Based Localization