RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu,, Long Chen, Xiaodan Liang, Li Yuan, Ge Li

TL;DR
This paper introduces RAP, a parameter-efficient method for text-video retrieval that fine-tunes large pre-trained models using sparse and correlated adapters, improving efficiency while maintaining high performance.
Contribution
The paper proposes a novel sparse-and-correlated adapter for efficient fine-tuning of vision-language models in text-video retrieval tasks.
Findings
RAP achieves comparable or superior performance to fully fine-tuned models.
The method reduces computational costs significantly.
Extensive experiments validate RAP's effectiveness across multiple datasets.
Abstract
Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsAdapter · ALIGN · Contrastive Language-Image Pre-training
