Dynamic Temporal Filtering in Video Models
Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah, Ngo, Tao Mei

TL;DR
This paper introduces Dynamic Temporal Filter (DTF), a novel method for long-range temporal modeling in videos that dynamically learns frequency domain filters for each spatial location, improving over fixed kernel approaches.
Contribution
The paper proposes DTF, a frequency domain-based temporal modeling technique that dynamically adapts filters per spatial location, enabling larger receptive fields and better long-range temporal understanding.
Findings
DTF outperforms existing methods on multiple datasets.
DTF-Transformer achieves 83.5% accuracy on Kinetics-400.
The approach effectively models long-range temporal dependencies.
Abstract
Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and static weights of a kernel along the temporal dimension. The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes. In this paper, we present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF), that novelly performs spatial-aware temporal modeling in frequency domain with large temporal receptive field. Specifically, DTF dynamically learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. Meanwhile,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Surveillance and Tracking Methods
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Layer Normalization · Softmax · Adam · Absolute Position Encodings
