Deep Multi-Kernel Convolutional LSTM Networks and an Attention-Based Mechanism for Videos
Sebastian Agethen, Winston H. Hsu

TL;DR
This paper introduces a multi-kernel convolutional LSTM network with an attention mechanism for improved video action recognition, demonstrating enhanced accuracy on benchmark datasets.
Contribution
It proposes a novel multi-kernel convolutional LSTM architecture combined with an attention mechanism, advancing motion-aware video analysis methods.
Findings
Improved accuracy on UCF-101 dataset
Enhanced performance on Sports-1M dataset
Qualitative analysis of model characteristics
Abstract
Action recognition greatly benefits motion understanding in video analysis. Recurrent networks such as long short-term memory (LSTM) networks are a popular choice for motion-aware sequence learning tasks. Recently, a convolutional extension of LSTM was proposed, in which input-to-hidden and hidden-to-hidden transitions are modeled through convolution with a single kernel. This implies an unavoidable trade-off between effectiveness and efficiency. Herein, we propose a new enhancement to convolutional LSTM networks that supports accommodation of multiple convolutional kernels and layers. This resembles a Network-in-LSTM approach, which improves upon the aforementioned concern. In addition, we propose an attention-based mechanism that is specifically designed for our multi-kernel extension. We evaluated our proposed extensions in a supervised classification setting on the UCF-101 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods
MethodsSigmoid Activation · Tanh Activation · Convolution · Long Short-Term Memory
