DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

Hayat Ullah; Muhammad Ali Shafique; Abbas Khan; and Arslan Munir

arXiv:2507.12426·cs.CV·July 21, 2025

DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition

Hayat Ullah, Muhammad Ali Shafique, Abbas Khan, and Arslan Munir

PDF

Open Access 1 Repo

TL;DR

DVFL-Net is a lightweight, knowledge-distilled video recognition model that balances high accuracy with computational efficiency, suitable for real-time on-device action recognition.

Contribution

The paper introduces DVFL-Net, a novel lightweight video recognition network that employs knowledge distillation and focal modulation for efficient spatiotemporal modeling.

Findings

01

Achieves competitive accuracy with reduced GFLOPs and memory usage.

02

Effectively transfers knowledge from a large teacher model to a compact student.

03

Demonstrates strong performance on multiple benchmark datasets.

Abstract

The landscape of video recognition has evolved significantly, shifting from traditional Convolutional Neural Networks (CNNs) to Transformer-based architectures for improved accuracy. While 3D CNNs have been effective at capturing spatiotemporal dynamics, recent Transformer models leverage self-attention to model long-range spatial and temporal dependencies. Despite achieving state-of-the-art performance on major benchmarks, Transformers remain computationally expensive, particularly with dense video data. To address this, we propose a lightweight Video Focal Modulation Network, DVFL-Net, which distills spatiotemporal knowledge from a large pre-trained teacher into a compact nano student model, enabling efficient on-device deployment. DVFL-Net utilizes knowledge distillation and spatial-temporal feature modulation to significantly reduce computation while preserving high recognition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hayatkhan8660-maker/DVFL-Net
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGait Recognition and Analysis · Hand Gesture Recognition Systems · Video Surveillance and Tracking Methods

MethodsDropout · Knowledge Distillation · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Balanced Selection · Transformer