Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action   Recognition

Syed Talal Wasim; Muhammad Uzair Khattak; Muzammal Naseer; Salman; Khan; Mubarak Shah; Fahad Shahbaz Khan

arXiv:2307.06947·cs.CV·October 30, 2023·2 cites

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman, Khan, Mubarak Shah, Fahad Shahbaz Khan

PDF

Open Access 3 Repos

TL;DR

Video-FocalNet introduces an efficient spatio-temporal focal modulation architecture that combines local and global context modeling for video recognition, outperforming transformer models with lower computational costs.

Contribution

This work proposes Video-FocalNet, a novel architecture that reverses self-attention steps using convolution and element-wise operations for efficient long-range context modeling in videos.

Findings

01

Outperforms state-of-the-art transformer models on five large-scale datasets.

02

Achieves comparable or better accuracy with lower computational cost.

03

Demonstrates the effectiveness of focal modulation for spatio-temporal video understanding.

Abstract

Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Label Smoothing · Linear Layer · Residual Connection · Adam · Dense Connections · Dropout · Convolution