Efficient Transfer Learning for Video-language Foundation Models
Haoxing Chen, Zizheng Huang, Yan Hong, Yanshuo Wang and, Zhongcai Lyu, Zhuoer Xu, Jun Lan, Zhangxuan Gu

TL;DR
This paper introduces a parameter-efficient multi-modal spatio-temporal adapter for video-language models that improves transfer learning performance across various tasks while using minimal additional parameters.
Contribution
It proposes a novel, lightweight adapter and a spatio-temporal consistency constraint to enhance video-language model transferability and generalization with minimal parameter increase.
Findings
Achieves state-of-the-art results in zero-shot, few-shot, and fully-supervised tasks.
Uses only 2-7% of the original model's trainable parameters.
Enhances model generalization and reduces overfitting.
Abstract
Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional modules to capture temporal information. Although the additional modules increase the capacity of model, enabling it to better capture video-specific inductive biases, existing methods typically introduce a substantial number of new parameters and are prone to catastrophic forgetting of previously acquired generalizable knowledge. In this paper, we propose a parameter-efficient Multi-modal Spatio-Temporal Adapter (MSTA) to enhance the alignment between textual and visual representations, achieving a balance between generalizable knowledge and task-specific adaptation. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsAdapter
