TL;DR
This paper introduces a plug-in Temporal Correlation Module (TCM) that enhances video action recognition by capturing fine-grained visual tempo from low-level features, improving performance across multiple benchmarks.
Contribution
The work presents a novel TCM with MTDM and TAM components that effectively extract temporal dynamics at a single-layer, outperforming previous multi-rate sampling methods.
Findings
Significant accuracy improvements on benchmarks like Kinetics-400 and Something-Something V2.
Effective extraction of both fast and slow temporal dynamics.
Plug-in design allows easy integration into existing models.
Abstract
Action visual tempo characterizes the dynamics and the temporal scale of an action, which is helpful to distinguish human actions that share high similarities in visual dynamics and appearance. Previous methods capture the visual tempo either by sampling raw videos with multiple rates, which require a costly multi-layer network to handle each rate, or by hierarchically sampling backbone features, which rely heavily on high-level features that miss fine-grained temporal dynamics. In this work, we propose a Temporal Correlation Module (TCM), which can be easily embedded into the current action recognition backbones in a plug-in-and-play manner, to extract action visual tempo from low-level backbone features at single-layer remarkably. Specifically, our TCM contains two main components: a Multi-scale Temporal Dynamics Module (MTDM) and a Temporal Attention Module (TAM). MTDM applies a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTemporal Adaptive Module · Low-level backbone
