Interpretable Vision Transformers in Monocular Depth Estimation via SVDA
Vasileios Arampatzakis, George Pavlidis, Nikolaos Mitianoudis, Nikos Papamarkos

TL;DR
This paper introduces SVDA, a spectrally structured attention mechanism for vision transformers in monocular depth estimation, enhancing interpretability while maintaining accuracy and providing new insights into attention organization.
Contribution
The paper presents SVDA, a novel spectrally structured attention formulation that makes transformer attention interpretable in dense prediction tasks like depth estimation.
Findings
SVDA preserves or slightly improves accuracy on KITTI and NYU-v2 datasets.
SVDA provides six spectral indicators revealing attention organization patterns.
Attention interpretability is significantly enhanced with minimal computational overhead.
Abstract
Monocular depth estimation is a central problem in computer vision with applications in robotics, AR, and autonomous driving, yet the self-attention mechanisms that drive modern Transformer architectures remain opaque. We introduce SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), providing the first spectrally structured formulation of attention for dense prediction tasks. SVDA decouples directional alignment from spectral modulation by embedding a learnable diagonal matrix into normalized query-key interactions, enabling attention maps that are intrinsically interpretable rather than post-hoc approximations. Experiments on KITTI and NYU-v2 show that SVDA preserves or slightly improves predictive accuracy while adding only minor computational overhead. More importantly, SVDA unlocks six spectral indicators that quantify entropy, rank, sparsity, alignment,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Domain Adaptation and Few-Shot Learning
