Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou,, Heng Ji, Shih-Fu Chang

TL;DR
This paper explores how to quickly adapt contrastive models for multi-channel video-language retrieval, finding that using discrete text tokens with a pretrained contrastive text model offers the best performance and can outperform state-of-the-art methods.
Contribution
It systematically analyzes model design choices for multi-channel video-language retrieval and identifies an effective combination of discrete text tokens with a pretrained contrastive text model.
Findings
Discrete text tokens with contrastive text models outperform other methods.
The proposed approach surpasses state-of-the-art on iVQA and How2QA datasets.
Representing videos as text tokens effectively captures visual information.
Abstract
Multi-channel video-language retrieval require models to understand information from different channels (e.g. videoquestion, videospeech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
MethodsSimCSE
