Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen

TL;DR
This paper introduces a multi-stream corpus alignment network with a dual softmax loss to improve video-text retrieval, effectively addressing heterogeneity issues and achieving state-of-the-art results on multiple benchmarks.
Contribution
It proposes CAMoE with Mixture-of-Experts for multi-perspective video representations and a novel Dual Softmax Loss to enhance similarity matching in video-text retrieval.
Findings
Achieves SOTA performance on MSR-VTT, MSVD, and LSMDC benchmarks.
Surpasses previous methods by around 4.6% R@1 on MSR-VTT.
Demonstrates the effectiveness of CAMoE and DSL individually and combined.
Abstract
Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-of-Experts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module. DSL is proposed to avoid the one-way optimum-match which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Contrastive Language-Image Pre-training · Linear Warmup With Linear Decay · Weight Decay · Adam · Dropout · Attention Dropout
