StableDPT: Temporal Stable Monocular Video Depth Estimation

Ivan Sobko; Hayko Riemenschneider; Markus Gross; Christopher Schroers

arXiv:2601.02793·cs.CV·January 7, 2026

StableDPT: Temporal Stable Monocular Video Depth Estimation

Ivan Sobko, Hayko Riemenschneider, Markus Gross, Christopher Schroers

PDF

Open Access

TL;DR

StableDPT is a novel method that enhances monocular video depth estimation by integrating a temporal module into existing models, significantly improving stability and efficiency across benchmarks.

Contribution

It introduces a trainable temporal layer with cross-attention in a Vision Transformer-based architecture, enabling stable, global context-aware depth estimation for videos.

Findings

01

Improves temporal stability and consistency in depth predictions.

02

Achieves state-of-the-art performance on benchmark datasets.

03

Processes videos twice as fast as previous methods.

Abstract

Applying single image Monocular Depth Estimation (MDE) models to video sequences introduces significant temporal instability and flickering artifacts. We propose a novel approach that adapts any state-of-the-art image-based (depth) estimation model for video processing by integrating a new temporal module - trainable on a single GPU in a few days. Our architecture StableDPT builds upon an off-the-shelf Vision Transformer (ViT) encoder and enhances the Dense Prediction Transformer (DPT) head. The core of our contribution lies in the temporal layers within the head, which use an efficient cross-attention mechanism to integrate information from keyframes sampled across the entire video sequence. This allows the model to capture global context and inter-frame relationships leading to more accurate and temporally stable depth predictions. Furthermore, we propose a novel inference strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Advanced Image Processing Techniques