MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma; Jiahui Yang; Donglin Di; Xuancheng Zhang; Jianxun Cui; Hao Li; Yan Xie; Wei Chen

arXiv:2601.22054·cs.CV·January 30, 2026

MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources

Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen

PDF

Open Access 2 Models

TL;DR

MetricAnything introduces a scalable pretraining framework for metric depth estimation from noisy, heterogeneous 3D data, enabling improved performance across various depth and spatial perception tasks without task-specific engineering.

Contribution

The paper presents a novel universal pretraining approach using Sparse Metric Prompts that effectively learns from diverse noisy 3D sources, demonstrating clear scaling trends and state-of-the-art results.

Findings

01

Achieved state-of-the-art monocular depth estimation results.

02

Demonstrated effective cross-source 3D data utilization.

03

Enhanced multimodal spatial intelligence with pretrained ViT.

Abstract

Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Robot Manipulation and Learning