Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual   Softmax Loss

Xing Cheng; Hezheng Lin; Xiangyu Wu; Fan Yang; Dong Shen

arXiv:2109.04290·cs.CV·November 23, 2021·66 cites

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen

PDF

Open Access 2 Repos

TL;DR

This paper introduces a multi-stream corpus alignment network with a dual softmax loss to improve video-text retrieval, effectively addressing heterogeneity issues and achieving state-of-the-art results on multiple benchmarks.

Contribution

It proposes CAMoE with Mixture-of-Experts for multi-perspective video representations and a novel Dual Softmax Loss to enhance similarity matching in video-text retrieval.

Findings

01

Achieves SOTA performance on MSR-VTT, MSVD, and LSMDC benchmarks.

02

Surpasses previous methods by around 4.6% R@1 on MSR-VTT.

03

Demonstrates the effectiveness of CAMoE and DSL individually and combined.

Abstract

Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Contrastive Language-Image Pre-training · Linear Warmup With Linear Decay · Weight Decay · Adam · Dropout · Attention Dropout