Stacked Convolutional Deep Encoding Network for Video-Text Retrieval

Rui Zhao; Kecheng Zheng; Zheng-jun Zha

arXiv:2004.04959·cs.MM·April 13, 2020·1 cites

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval

Rui Zhao, Kecheng Zheng, Zheng-jun Zha

PDF

Open Access

TL;DR

This paper introduces a stacked convolutional deep encoding network that captures both long-range and short-range dependencies in videos and texts for improved cross-modal video-text retrieval, leveraging multi-scale dilated convolutions and Transformer-based language modeling.

Contribution

It proposes a novel stacked convolutional architecture with multi-scale dilated convolutions and Transformer-based text encoding for better video-text retrieval performance.

Findings

01

Outperforms state-of-the-art methods on MSR-VTT and MSVD datasets.

02

Effectively encodes long-range and short-range dependencies in videos and texts.

03

Demonstrates robustness of the proposed approach through extensive experiments.

Abstract

Existing dominant approaches for cross-modal video-text retrieval task are to learn a joint embedding space to measure the cross-modal similarity. However, these methods rarely explore long-range dependency inside video frames or textual words leading to insufficient textual and visual details. In this paper, we propose a stacked convolutional deep encoding network for video-text retrieval task, which considers to simultaneously encode long-range and short-range dependency in the videos and texts. Specifically, a multi-scale dilated convolutional (MSDC) block within our approach is able to encode short-range temporal cues between video frames or text words by adopting different scales of kernel size and dilation size of convolutional layer. A stacked structure is designed to expand the receptive fields by repeatedly adopting the MSDC block, which further captures the long-range…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax