HiT: Hierarchical Transformer with Momentum Contrast for Video-Text   Retrieval

Song Liu; Haoqi Fan; Shengsheng Qian; Yiru Chen; Wenkui; Ding; Zhongyuan Wang

arXiv:2103.15049·cs.CV·August 19, 2021·1 cites

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui, Ding, Zhongyuan Wang

PDF

Open Access

TL;DR

This paper introduces HiT, a hierarchical transformer with momentum contrast for improved video-text retrieval, leveraging multi-level contrastive learning and large-scale negative sample interactions to enhance retrieval accuracy.

Contribution

The paper proposes a novel Hierarchical Transformer with momentum contrast, enabling multi-view and comprehensive retrieval with large-scale negative sample interactions.

Findings

01

Outperforms existing methods on major benchmarks

02

Effective multi-level contrastive learning improves retrieval accuracy

03

Momentum contrast enables large-scale negative sample interaction

Abstract

Video-Text Retrieval has been a hot research topic with the growth of multimedia data on the internet. Transformer for video-text learning has attracted increasing attention due to its promising performance. However, existing cross-modal transformer approaches typically suffer from two major limitations: 1) Exploitation of the transformer architecture where different layers have different feature characteristics is limited; 2) End-to-end training mechanism limits negative sample interactions in a mini-batch. In this paper, we propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval. HiT performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results. Moreover, inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Softmax · Dense Connections · Attention Is All You Need · Dropout · Layer Normalization · Residual Connection