Efficient Transfer Learning for Video-language Foundation Models

Haoxing Chen; Zizheng Huang; Yan Hong; Yanshuo Wang and; Zhongcai Lyu; Zhuoer Xu; Jun Lan; Zhangxuan Gu

arXiv:2411.11223·cs.CV·March 19, 2025

Efficient Transfer Learning for Video-language Foundation Models

Haoxing Chen, Zizheng Huang, Yan Hong, Yanshuo Wang and, Zhongcai Lyu, Zhuoer Xu, Jun Lan, Zhangxuan Gu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a parameter-efficient multi-modal spatio-temporal adapter for video-language models that improves transfer learning performance across various tasks while using minimal additional parameters.

Contribution

It proposes a novel, lightweight adapter and a spatio-temporal consistency constraint to enhance video-language model transferability and generalization with minimal parameter increase.

Findings

01

Achieves state-of-the-art results in zero-shot, few-shot, and fully-supervised tasks.

02

Uses only 2-7% of the original model's trainable parameters.

03

Enhances model generalization and reduces overfitting.

Abstract

Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional modules to capture temporal information. Although the additional modules increase the capacity of model, enabling it to better capture video-specific inductive biases, existing methods typically introduce a substantial number of new parameters and are prone to catastrophic forgetting of previously acquired generalizable knowledge. In this paper, we propose a parameter-efficient Multi-modal Spatio-Temporal Adapter (MSTA) to enhance the alignment between textual and visual representations, achieving a balance between generalizable knowledge and task-specific adaptation. Furthermore, to mitigate over-fitting and enhance generalizability, we introduce a spatio-temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenhaoxing/etl4video
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsAdapter