DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition
Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, and Arslan Munir

TL;DR
DVFL-Net is a lightweight, knowledge-distilled video recognition model that balances high accuracy with computational efficiency, suitable for real-time on-device action recognition.
Contribution
The paper introduces DVFL-Net, a novel lightweight video recognition network that employs knowledge distillation and focal modulation for efficient spatiotemporal modeling.
Findings
Achieves competitive accuracy with reduced GFLOPs and memory usage.
Effectively transfers knowledge from a large teacher model to a compact student.
Demonstrates strong performance on multiple benchmark datasets.
Abstract
The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGait Recognition and Analysis · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods
MethodsDropout · Knowledge Distillation · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Balanced Selection · Transformer
