Visual Autoregressive Modelling for Monocular Depth Estimation
Amir El-Ghoussani, Andr\'e Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis

TL;DR
This paper introduces a novel monocular depth estimation method using visual autoregressive priors, leveraging large-scale text-to-image models and a scale-wise upsampling mechanism, achieving state-of-the-art results with limited training data.
Contribution
It presents a new autoregressive approach for depth estimation that outperforms existing methods, especially under constrained training conditions, and demonstrates the versatility of autoregressive priors in 3D vision tasks.
Findings
State-of-the-art indoor benchmark performance
Strong outdoor dataset results
Requires only 74K synthetic samples for fine-tuning
Abstract
We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Video Coding and Compression Technologies
