Understanding Video Transformers via Universal Concept Discovery

Matthew Kowal; Achal Dave; Rares Ambrus; Adrien Gaidon; Konstantinos; G. Derpanis; Pavel Tokmakov

arXiv:2401.10831·cs.CV·April 11, 2024·1 cites

Understanding Video Transformers via Universal Concept Discovery

Matthew Kowal, Achal Dave, Rares Ambrus, Adrien Gaidon, Konstantinos, G. Derpanis, Pavel Tokmakov

PDF

Open Access

TL;DR

This paper introduces VTCD, an unsupervised method to discover and interpret high-level spatiotemporal concepts in video transformers, enhancing understanding of their decision processes and enabling applications like action recognition and segmentation.

Contribution

It presents the first algorithm for concept discovery in video transformers, addressing temporal complexity and revealing universal interpretability mechanisms.

Findings

01

Discovered interpretable spatio-temporal concepts in video transformers.

02

Identified universal reasoning mechanisms across different models.

03

Enabled fine-grained action recognition and object segmentation.

Abstract

This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely, we seek to explain the decision-making process of video transformers based on high-level, spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively, video models deal with the added temporal dimension, increasing complexity and posing challenges in identifying dynamic concepts over time. In this work, we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts, and ranking their importance to the output of a model. The resulting concepts are highly interpretable,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)

MethodsSparse Evolutionary Training · Attention Is All You Need · Absolute Position Encodings · Label Smoothing · Layer Normalization · Adam · Residual Connection · Dropout · Linear Layer · Multi-Head Attention