ViGAT: Bottom-up event recognition and explanation in video using   factorized graph attention network

Nikolaos Gkalelis; Dimitrios Daskalakis; Vasileios Mezaris

arXiv:2207.09927·cs.CV·November 1, 2022

ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

Nikolaos Gkalelis, Dimitrios Daskalakis, Vasileios Mezaris

PDF

Open Access 1 Repo

TL;DR

ViGAT introduces a bottom-up, attention-based model utilizing object detection and graph attention networks to recognize and explain events in videos, achieving state-of-the-art results on multiple datasets.

Contribution

The paper presents a novel pure-attention bottom-up approach with factorized graph attention for event recognition and explanation in videos, emphasizing interpretability and effectiveness.

Findings

01

Achieves state-of-the-art performance on FCVID, Mini-Kinetics, and ActivityNet datasets.

02

Effectively identifies salient objects and frames using weighted in-degrees from GAT.

03

Demonstrates the importance of spatial and temporal dependencies in video event recognition.

Abstract

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bmezaris/vigat
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Advanced Neural Network Applications · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Layer Normalization · Adam · Label Smoothing