Knowledge Fusion Transformers for Video Action Recognition
Ganesh Samarth, Sheetal Ojha, Nikhil Pareek

TL;DR
This paper introduces Knowledge Fusion Transformers that enhance video action recognition by fusing action knowledge through self-attention mechanisms, achieving competitive results with minimal pretraining on standard datasets.
Contribution
The paper proposes a novel self-attention based feature enhancer within a transformer architecture for video action classification, reducing reliance on extensive pretraining.
Findings
Achieves performance close to state-of-the-art with only one stream network.
Outperforms single stream networks with little or no pretraining.
Effective fusion of different self-attention architectures improves feature representation.
Abstract
We introduce Knowledge Fusion Transformers for video action classification. We present a self-attention based feature enhancer to fuse action knowledge in 3D inception based spatio-temporal context of the video clip intended to be classified. We show, how using only one stream networks and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art. Additionally, we present how different self-attention architectures used at different levels of the network can be blended-in to enhance feature representation. Our architecture is trained and evaluated on UCF-101 and Charades dataset, where it is competitive with the state of the art. It also exceeds by a large gap from single stream networks with no to less pretraining.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
