Long-term Temporal Convolutions for Action Recognition
G\"ul Varol, Ivan Laptev, Cordelia Schmid

TL;DR
This paper introduces long-term temporal convolutions in neural networks to better capture full-duration actions in videos, significantly improving recognition accuracy on benchmark datasets.
Contribution
It proposes LTC-CNN models with extended temporal receptive fields and highlights the importance of high-quality optical flow for action recognition.
Findings
Achieved state-of-the-art accuracy on UCF101 (92.7%)
Achieved state-of-the-art accuracy on HMDB51 (67.2%)
Long-term convolutions improve action recognition performance
Abstract
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
