HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui, Ding, Zhongyuan Wang

TL;DR
This paper introduces HiT, a hierarchical transformer with momentum contrast for improved video-text retrieval, leveraging multi-level contrastive learning and large-scale negative sample interactions to enhance retrieval accuracy.
Contribution
The paper proposes a novel Hierarchical Transformer with momentum contrast, enabling multi-view and comprehensive retrieval with large-scale negative sample interactions.
Findings
Outperforms existing methods on major benchmarks
Effective multi-level contrastive learning improves retrieval accuracy
Momentum contrast enables large-scale negative sample interaction
Abstract
Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Softmax · Dense Connections · Attention Is All You Need · Dropout · Layer Normalization · Residual Connection
