Understanding Transformer-based Vision Models through Inversion

Jan Rathjens; Shirin Reyhanian; David Kappel; Laurenz Wiskott

arXiv:2412.06534·cs.CV·August 15, 2025

Understanding Transformer-based Vision Models through Inversion

Jan Rathjens, Shirin Reyhanian, David Kappel, Laurenz Wiskott

PDF

Open Access 2 Repos

TL;DR

This paper introduces an efficient feature inversion method to interpret transformer-based vision models, revealing how they encode image features, shape, and details, thereby deepening understanding of their internal representations.

Contribution

The study presents a novel, modular feature inversion technique applied to large-scale transformer vision models, enabling qualitative and quantitative analysis of their internal representations.

Findings

01

Models encode contextual shape and image details.

02

Layer correlations reveal internal structure.

03

Robustness against color perturbations is analyzed.

Abstract

Understanding the mechanisms underlying deep neural networks remains a fundamental challenge in machine learning and computer vision. One promising, yet only preliminarily explored approach, is feature inversion, which attempts to reconstruct images from intermediate representations using trained inverse neural networks. In this study, we revisit feature inversion, introducing a novel, modular variation that enables significantly more efficient application of the technique. We demonstrate how our method can be systematically applied to the large-scale transformer-based vision models, Detection Transformer and Vision Transformer, and how reconstructed images can be qualitatively interpreted in a meaningful way. We further quantitatively evaluate our method, thereby uncovering underlying mechanisms of representing image features that emerge in the two transformer architectures. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Vision and Imaging

MethodsAttention Is All You Need · Vision Transformer · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention