Towards Fast Adaptation of Pretrained Contrastive Models for   Multi-channel Video-Language Retrieval

Xudong Lin; Simran Tiwari; Shiyuan Huang; Manling Li; Mike Zheng Shou,; Heng Ji; Shih-Fu Chang

arXiv:2206.02082·cs.CV·April 12, 2023·1 cites

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou,, Heng Ji, Shih-Fu Chang

PDF

Open Access 1 Repo

TL;DR

This paper explores how to quickly adapt contrastive models for multi-channel video-language retrieval, finding that using discrete text tokens with a pretrained contrastive text model offers the best performance and can outperform state-of-the-art methods.

Contribution

It systematically analyzes model design choices for multi-channel video-language retrieval and identifies an effective combination of discrete text tokens with a pretrained contrastive text model.

Findings

01

Discrete text tokens with contrastive text models outperform other methods.

02

The proposed approach surpasses state-of-the-art on iVQA and How2QA datasets.

03

Representing videos as text tokens effectively captures visual information.

Abstract

Multi-channel video-language retrieval require models to understand information from different channels (e.g. video $+$ question, video $+$ speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xudonglinthu/upgradable-multimodal-intelligence
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling

MethodsSimCSE