ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network
Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris

TL;DR
ViGAT introduces a bottom-up, attention-based model utilizing object detection and graph attention networks to recognize and explain events in videos, achieving state-of-the-art results on multiple datasets.
Contribution
The paper presents a novel pure-attention bottom-up approach with factorized graph attention for event recognition and explanation in videos, emphasizing interpretability and effectiveness.
Findings
Achieves state-of-the-art performance on FCVID, Mini-Kinetics, and ActivityNet datasets.
Effectively identifies salient objects and frames using weighted in-degrees from GAT.
Demonstrates the importance of spatial and temporal dependencies in video event recognition.
Abstract
In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Advanced Neural Network Applications · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Label Smoothing
