Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification
Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei, Wen

TL;DR
This paper introduces a purely attention-based local feature integration framework for video classification, demonstrating competitive results without relying on traditional CNNs or RNNs, and achieving state-of-the-art performance on large datasets.
Contribution
The paper proposes a novel attention clusters framework with a shifting operation for local feature integration in video classification, challenging the necessity of long-term temporal modeling.
Findings
Achieves 79.4% top-1 accuracy on Kinetics dataset
Outperforms many existing methods in video classification
Wins the ActivityNet Kinetics Challenge 2017
Abstract
Recently, substantial research effort has focused on how to apply CNNs or RNNs to better extract temporal patterns from videos, so as to improve the accuracy of video classification. In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets. We investigate the potential of a purely attention based local feature integration. Accounting for the characteristics of such features in video classification, we propose a local feature integration framework based on attention clusters, and introduce a shifting operation to capture more diverse signals. We carefully analyze and compare the effect of different attention mechanisms, cluster sizes, and the use of the shifting operation, and also investigate the combination of attention clusters for multimodal integration. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Video Analysis and Summarization
