Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

Vasileios Arampatzakis; George Pavlidis; Nikolaos Mitianoudis; Nikos Papamarkos

arXiv:2602.11005·cs.CV·February 12, 2026

Interpretable Vision Transformers in Monocular Depth Estimation via SVDA

Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos

PDF

Open Access

TL;DR

This paper introduces SVDA, a spectrally structured attention mechanism for vision transformers in monocular depth estimation, enhancing interpretability while maintaining accuracy and providing new insights into attention organization.

Contribution

The paper presents SVDA, a novel spectrally structured attention formulation that makes transformer attention interpretable in dense prediction tasks like depth estimation.

Findings

01

SVDA preserves or slightly improves accuracy on KITTI and NYU-v2 datasets.

02

SVDA provides six spectral indicators revealing attention organization patterns.

03

Attention interpretability is significantly enhanced with minimal computational overhead.

Abstract

Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Domain Adaptation and Few-Shot Learning