Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models

Zijie Cai; Christopher Metzler

arXiv:2507.02148·cs.CV·July 11, 2025

Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models

Zijie Cai, Christopher Metzler

PDF

Open Access

TL;DR

This paper benchmarks and improves monocular metric depth estimation in underwater environments by evaluating foundation models and fine-tuning on synthetic underwater data, addressing domain shift challenges.

Contribution

It provides a comprehensive benchmark of underwater depth estimation models and introduces a synthetic fine-tuning approach to enhance performance in challenging underwater conditions.

Findings

01

Large terrestrial models perform poorly underwater due to domain shift.

02

Fine-tuning on synthetic underwater data improves depth estimation accuracy.

03

Domain adaptation is crucial for robust underwater monocular depth prediction.

Abstract

Monocular depth estimation has recently progressed beyond ordinal depth to provide metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, including FLSea and SQUID. We evaluated a diverse set of state-of-the-art Vision Foundation Models across a range of underwater conditions and depth ranges. Our results show that large-scale models trained on terrestrial data (real or synthetic) are effective in in-air settings, but perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Human Pose and Action Recognition

MethodsSparse Evolutionary Training