ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara

TL;DR
ViT-Explainer is an interactive visualization tool that helps users understand the entire inference process of Vision Transformers through animations, attention overlays, and guided exploration.
Contribution
It introduces a web-based system that provides end-to-end interpretability of Vision Transformers, combining visualizations and interactive features for better understanding.
Findings
Participants found ViT-Explainer easy to learn and use.
The system effectively visualizes the full Vision Transformer pipeline.
User study indicates improved interpretability of Vision Transformers.
Abstract
Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
