Dynamic Temporal Filtering in Video Models

Fuchen Long; Zhaofan Qiu; Yingwei Pan; Ting Yao; Chong-Wah; Ngo; Tao Mei

arXiv:2211.08252·cs.CV·November 16, 2022

Dynamic Temporal Filtering in Video Models

Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah, Ngo, Tao Mei

PDF

Open Access 1 Repo

TL;DR

This paper introduces Dynamic Temporal Filter (DTF), a novel method for long-range temporal modeling in videos that dynamically learns frequency domain filters for each spatial location, improving over fixed kernel approaches.

Contribution

The paper proposes DTF, a frequency domain-based temporal modeling technique that dynamically adapts filters per spatial location, enabling larger receptive fields and better long-range temporal understanding.

Findings

01

DTF outperforms existing methods on multiple datasets.

02

DTF-Transformer achieves 83.5% accuracy on Kinetics-400.

03

The approach effectively models long-range temporal dependencies.

Abstract

Video temporal dynamics is conventionally modeled with 3D spatial-temporal kernel or its factorized version comprised of 2D spatial kernel and 1D temporal kernel. The modeling power, nevertheless, is limited by the fixed window size and static weights of a kernel along the temporal dimension. The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes. In this paper, we present a new recipe of temporal feature learning, namely Dynamic Temporal Filter (DTF), that novelly performs spatial-aware temporal modeling in frequency domain with large temporal receptive field. Specifically, DTF dynamically learns a specialized frequency filter for every spatial location to model its long-range temporal dynamics. Meanwhile,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fuchenustc/dtf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Layer Normalization · Softmax · Adam · Absolute Position Encodings