Visualizing and Understanding Patch Interactions in Vision Transformer

Jie Ma; Yalong Bai; Bineng Zhong; Wei Zhang; Ting Yao; Tao Mei

arXiv:2203.05922·cs.CV·March 14, 2022

Visualizing and Understanding Patch Interactions in Vision Transformer

Jie Ma, Yalong Bai, Bineng Zhong, Wei Zhang, Ting Yao, Tao Mei

PDF

TL;DR

This paper introduces a visualization method to analyze patch interactions in Vision Transformers, quantifies their impact, and designs a window-free architecture that improves accuracy and generalization across tasks.

Contribution

It presents a novel explainability approach for ViT, including a quantification indicator and a window-free architecture, enhancing understanding and performance.

Findings

01

Quantification indicator effectively measures patch interaction impact.

02

Design of window-free transformer architecture improves top-1 accuracy by up to 4.28%.

03

Method generalizes well to downstream fine-grained recognition tasks.

Abstract

Vision Transformer (ViT) has become a leading tool in various computer vision tasks, owing to its unique self-attention mechanism that learns visual representations explicitly through cross-patch information interactions. Despite having good success, the literature seldom explores the explainability of vision transformer, and there is no clear picture of how the attention mechanism with respect to the correlation across comprehensive patches will impact the performance and what is the further potential. In this work, we propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer. Specifically, we first introduce a quantification indicator to measure the impact of patch interaction and verify such quantification on attention window design and indiscriminative patches removal. Then, we exploit the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Dense Connections · Residual Connection · Layer Normalization · Absolute Position Encodings · Adam · Label Smoothing