Visual Autoregressive Modelling for Monocular Depth Estimation

Amir El-Ghoussani; Andr\'e Kaup; Nassir Navab; Gustavo Carneiro; Vasileios Belagiannis

arXiv:2512.22653·cs.CV·December 30, 2025

Visual Autoregressive Modelling for Monocular Depth Estimation

Amir El-Ghoussani, Andr\'e Kaup, Nassir Navab, Gustavo Carneiro, Vasileios Belagiannis

PDF

Open Access

TL;DR

This paper introduces a novel monocular depth estimation method using visual autoregressive priors, leveraging large-scale text-to-image models and a scale-wise upsampling mechanism, achieving state-of-the-art results with limited training data.

Contribution

It presents a new autoregressive approach for depth estimation that outperforms existing methods, especially under constrained training conditions, and demonstrates the versatility of autoregressive priors in 3D vision tasks.

Findings

01

State-of-the-art indoor benchmark performance

02

Strong outdoor dataset results

03

Requires only 74K synthetic samples for fine-tuning

Abstract

We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Video Coding and Compression Technologies