TAM: Temporal Adaptive Module for Video Recognition
Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, Tong Lu

TL;DR
This paper introduces TAM, a novel temporal adaptive module that dynamically generates video-specific kernels to effectively model complex temporal dynamics in videos, improving recognition performance with minimal additional computational cost.
Contribution
The paper proposes TAM, a new modular temporal adaptive module that decouples dynamic kernels into importance maps and aggregation weights, enhancing video recognition models.
Findings
TAM outperforms existing temporal modeling methods on Kinetics-400 and Something-Something datasets.
TAM achieves state-of-the-art performance with minimal extra computational cost.
TAM can be integrated into 2D CNNs to create effective video recognition architectures.
Abstract
Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module ({\bf TAM}) to generate video-specific temporal kernels based on its own feature map. TAM proposes a unique two-level adaptive modeling scheme by decoupling the dynamic kernel into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short-term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a modular block and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Temporal Adaptive Module
