TDS-CLIP: Temporal Difference Side Network for Efficient VideoAction Recognition
Bin Wang, Wentong Li, Wenqian Wang, Mingliang Gao, Runmin Cong, Wei Zhang

TL;DR
This paper introduces TDS-CLIP, a memory-efficient side network that enhances temporal modeling and motion feature learning in video action recognition by leveraging adapters, without extensive backpropagation, achieving competitive results.
Contribution
The paper proposes a novel TDS-CLIP framework with specialized adapters to improve temporal and motion feature learning in video recognition, reducing training costs.
Findings
Achieves competitive accuracy on benchmark datasets.
Effectively captures local temporal differences in motion features.
Enhances motion information learning with minimal backpropagation.
Abstract
Recently, large-scale pre-trained vision-language models (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a novel memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods
MethodsSoftmax · Attention Is All You Need · Adapter
