Underwater Monocular Metric Depth Estimation: Real-World Benchmarks and Synthetic Fine-Tuning with Vision Foundation Models
Zijie Cai, Christopher Metzler

TL;DR
This paper benchmarks and improves monocular metric depth estimation in underwater environments by evaluating foundation models and fine-tuning on synthetic underwater data, addressing domain shift challenges.
Contribution
It provides a comprehensive benchmark of underwater depth estimation models and introduces a synthetic fine-tuning approach to enhance performance in challenging underwater conditions.
Findings
Large terrestrial models perform poorly underwater due to domain shift.
Fine-tuning on synthetic underwater data improves depth estimation accuracy.
Domain adaptation is crucial for robust underwater monocular depth prediction.
Abstract
Monocular depth estimation has recently progressed beyond ordinal depth to provide metric depth predictions. However, its reliability in underwater environments remains limited due to light attenuation and scattering, color distortion, turbidity, and the lack of high-quality metric ground truth data. In this paper, we present a comprehensive benchmark of zero-shot and fine-tuned monocular metric depth estimation models on real-world underwater datasets with metric depth annotations, including FLSea and SQUID. We evaluated a diverse set of state-of-the-art Vision Foundation Models across a range of underwater conditions and depth ranges. Our results show that large-scale models trained on terrestrial data (real or synthetic) are effective in in-air settings, but perform poorly underwater due to significant domain shifts. To address this, we fine-tune Depth Anything V2 with a ViT-S…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Human Pose and Action Recognition
MethodsSparse Evolutionary Training
