STAA: Spatio-Temporal Attention Attribution for Real-Time Interpreting Transformer-based Video Models
Zerui Wang, Yan Liu

TL;DR
STAA is a novel explainability method for video Transformer models that provides simultaneous spatial and temporal attributions with low computational cost, enabling real-time analysis.
Contribution
Introduces STAA, an XAI technique that offers combined spatial-temporal explanations from Transformer attention, optimized for real-time video analysis.
Findings
STAA produces more precise visual explanations.
Requires less than 3% of traditional XAI computational resources.
Effective in real-time video Transformer interpretability.
Abstract
Transformer-based models have achieved state-of-the-art performance in various computer vision tasks, including image and video analysis. However, Transformer's complex architecture and black-box nature pose challenges for explainability, a crucial aspect for real-world applications and scientific inquiry. Current Explainable AI (XAI) methods can only provide one-dimensional feature importance, either spatial or temporal explanation, with significant computational complexity. This paper introduces STAA (Spatio-Temporal Attention Attribution), an XAI method for interpreting video Transformer models. Differ from traditional methods that separately apply image XAI techniques for spatial features or segment contribution analysis for temporal aspects, STAA offers both spatial and temporal information simultaneously from attention values in Transformers. The study utilizes the Kinetics-400…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings
