Transformer Interpretability from Perspective of Attention and Gradient
Yongjin Cui, Xiaohui Fan, Huajun Chen

TL;DR
This paper explores Transformer interpretability through attention and gradient analysis, proposing a method that guides gradient and attention directions to enhance feature interpretation and understanding of Transformer mechanisms.
Contribution
It introduces a novel approach to Transformer interpretation by guiding gradients and attention, and reveals security concerns via class rewriting in Vision Transformers.
Findings
Guided gradient and attention improve feature region interpretation.
The method offers detailed insights into Transformer mechanisms.
Class rewriting in Vision Transformers can be almost imperceptible to humans.
Abstract
Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention and gradient, we conduct an in-depth study of Transformer interpretation and propose a method to achieve it by guiding the gradient direction, or more precisely, the attention direction. The method enables more comprehensive interpretation of feature regions, offers detail interpretation, and helps to better understand Transformer mechanism. Leveraging the difference in how Vision Transformer (ViT) and humans perceive images, we alter the class of an image in a way that is almost imperceptible to the human eye. This class rewriting phenomenon may potentially pose security risks in certain scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
