DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Leqi Shen; Guoqiang Gong; Tianxiang Hao; Tao He; Yifeng Zhang; Pengzhang Liu; Sicheng Zhao; Jungong Han; Guiguang Ding

arXiv:2506.08887·cs.CV·June 11, 2025

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

Leqi Shen, Guoqiang Gong, Tianxiang Hao, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, Jungong Han, Guiguang Ding

PDF

Open Access 1 Repo 1 Datasets

TL;DR

DiscoVLA introduces a comprehensive approach to improve video-text retrieval by simultaneously addressing vision, language, and alignment discrepancies in adapting CLIP from images to videos, leading to superior performance.

Contribution

The paper proposes a novel method that reduces all three key discrepancies in video-text retrieval, including image-video feature fusion, pseudo caption generation, and alignment distillation.

Findings

01

Outperforms previous methods on MSRVTT by 1.5% R@1.

02

Effectively integrates image and video features for better retrieval.

03

Enhances alignment accuracy through distillation techniques.

Abstract

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lunarshen/dsicovla
pytorchOfficial

Datasets

LeqiShen/DiscoVLA
dataset· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling

MethodsContrastive Language-Image Pre-training · Focus