Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang

TL;DR
Depth Anything 3 (DA3) is a versatile model that predicts consistent 3D geometry from various views using minimal architecture, achieving state-of-the-art results in visual geometry tasks without specialized components.
Contribution
DA3 introduces a simple transformer-based approach with a single depth-ray prediction target, eliminating complex multi-task learning and achieving high accuracy with minimal modeling.
Findings
DA3 surpasses prior SOTA in camera pose accuracy by 44.3%.
DA3 outperforms DA2 in monocular depth estimation.
Established a new benchmark for visual geometry tasks.
Abstract
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it…
Peer Reviews
Decision·ICLR 2026 Oral
1. This paper presents a thoughtful analysis of what modalities are truly necessary for strong vision understanding tasks. It argues that depth together with a ray map is a minimal and sufficient target set. The ablation in Table 5 convincingly supports this claim by outperforming alternatives. Although recent work MapAnything also discusses incorporating ray maps into a unified representation, it is a contemporaneous work and does not needed to be considered here. 2. The Dual DPT head is well
I did not find any major weaknesses. While I know recent advances in this area, I am not fully confident about all technical nuances and distinctions among closely related methods. I am open to perspectives from other reviewers and will continue to track the discussion.
1. The architecture of Depth Anything 3 is simpler than previous methods. Depth Anything 3 uses a single vision transformer, while previous methods typically use vision transformer and following self- & cross-attention. Input-adaptive self-attention is used in vision transformer to enable cross-view attention without introducing new attention layers. With a simpler structure, Depth Anything 3 is able to process more images, which is meaningful for the future research. 2. Extensive and thorough
1. In Table 1 and Table 2, I recommend adding some state-of-the-art methods that are not feed-forward models. This can help the readers have a better understanding of the performance difference between different methods. For example, classical pipelines generally outperform feed-forward models in 3D reconstruction. 2. If the teacher is not used, would the performance degrade explicitly? Currently, I am not sure if the mainly improvement is from the powerful teacher.
- The paper utilized depth and ray map representations to enable full 3D reconstruction from an arbitrary number of input images. - Discovered an effective architecture design that outperforms previous methods while requiring minimal modifications to DINOv2. - The paper demonstrates their model's effectiveness across various experimental settings.
- **Unclear advantage of depth+ray over point map:** To my knowledge, point maps can effectively represent various 3D information such as depth and pose, and a point map is essentially a combination of depth and ray maps. However, Table 5 shows that point maps hurt pose accuracy. What is the reason for this performance degradation? This finding appears to contradict the ablation study in VGGT, which argues that point map accuracy increases with multimodal outputs. I would like to see a more comp
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robot Manipulation and Learning · Robotics and Sensor-Based Localization
