Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
Syed Talal Wasim, Muhammad Uzair Khattak, Muzammal Naseer, Salman, Khan, Mubarak Shah, Fahad Shahbaz Khan

TL;DR
Video-FocalNet introduces an efficient spatio-temporal focal modulation architecture that combines local and global context modeling for video recognition, outperforming transformer models with lower computational costs.
Contribution
This work proposes Video-FocalNet, a novel architecture that reverses self-attention steps using convolution and element-wise operations for efficient long-range context modeling in videos.
Findings
Outperforms state-of-the-art transformer models on five large-scale datasets.
Achieves comparable or better accuracy with lower computational cost.
Demonstrates the effectiveness of focal modulation for spatio-temporal video understanding.
Abstract
Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Label Smoothing · Linear Layer · Residual Connection · Adam · Dense Connections · Dropout · Convolution
