RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Meng Cao; Haoran Tang; Jinfa Huang; Peng Jin; Can Zhang; Ruyang Liu,; Long Chen; Xiaodan Liang; Li Yuan; Ge Li

arXiv:2405.19465·cs.CV·May 31, 2024

RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter

Meng Cao, Haoran Tang, Jinfa Huang, Peng Jin, Can Zhang, Ruyang Liu,, Long Chen, Xiaodan Liang, Li Yuan, Ge Li

PDF

Open Access

TL;DR

This paper introduces RAP, a parameter-efficient method for text-video retrieval that fine-tunes large pre-trained models using sparse and correlated adapters, improving efficiency while maintaining high performance.

Contribution

The paper proposes a novel sparse-and-correlated adapter for efficient fine-tuning of vision-language models in text-video retrieval tasks.

Findings

01

RAP achieves comparable or superior performance to fully fine-tuned models.

02

The method reduces computational costs significantly.

03

Extensive experiments validate RAP's effectiveness across multiple datasets.

Abstract

Text-Video Retrieval (TVR) aims to align relevant video content with natural language queries. To date, most state-of-the-art TVR methods learn image-to-video transfer learning based on large-scale pre-trained visionlanguage models (e.g., CLIP). However, fully fine-tuning these pre-trained models for TVR incurs prohibitively expensive computation costs. To this end, we propose to conduct efficient text-video Retrieval with a sparse-andcorrelated AdaPter (RAP), i.e., fine-tuning the pre-trained model with a few parameterized layers. To accommodate the text-video scenario, we equip our RAP with two indispensable characteristics: temporal sparsity and correlation. Specifically, we propose a low-rank modulation module to refine the per-image features from the frozen CLIP backbone, which accentuates salient frames within the video features while alleviating temporal redundancy. Besides, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsAdapter · ALIGN · Contrastive Language-Image Pre-training