Knowledge Fusion Transformers for Video Action Recognition

Ganesh Samarth; Sheetal Ojha; Nikhil Pareek

arXiv:2009.13782·cs.CV·October 1, 2020

Knowledge Fusion Transformers for Video Action Recognition

Ganesh Samarth, Sheetal Ojha, Nikhil Pareek

PDF

Open Access

TL;DR

This paper introduces Knowledge Fusion Transformers that enhance video action recognition by fusing action knowledge through self-attention mechanisms, achieving competitive results with minimal pretraining on standard datasets.

Contribution

The paper proposes a novel self-attention based feature enhancer within a transformer architecture for video action classification, reducing reliance on extensive pretraining.

Findings

01

Achieves performance close to state-of-the-art with only one stream network.

02

Outperforms single stream networks with little or no pretraining.

03

Effective fusion of different self-attention architectures improves feature representation.

Abstract

We introduce Knowledge Fusion Transformers for video action classification. We present a self-attention based feature enhancer to fuse action knowledge in 3D inception based spatio-temporal context of the video clip intended to be classified. We show, how using only one stream networks and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art. Additionally, we present how different self-attention architectures used at different levels of the network can be blended-in to enhance feature representation. Our architecture is trained and evaluated on UCF-101 and Charades dataset, where it is competitive with the state of the art. It also exceeds by a large gap from single stream networks with no to less pretraining.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications