Understanding Video Transformers via Universal Concept Discovery
Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos, G. Derpanis, Pavel Tokmakov

TL;DR
This paper introduces VTCD, an unsupervised method to discover and interpret high-level spatiotemporal concepts in video transformers, enhancing understanding of their decision processes and enabling applications like action recognition and segmentation.
Contribution
It presents the first algorithm for concept discovery in video transformers, addressing temporal complexity and revealing universal interpretability mechanisms.
Findings
Discovered interpretable spatio-temporal concepts in video transformers.
Identified universal reasoning mechanisms across different models.
Enabled fine-grained action recognition and object segmentation.
Abstract
This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
MethodsSparse Evolutionary Training · Attention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Residual Connection · Dropout · Linear Layer · Multi-Head Attention
