Stacked Convolutional Deep Encoding Network for Video-Text Retrieval
Rui Zhao, Kecheng Zheng, Zheng-jun Zha

TL;DR
This paper introduces a stacked convolutional deep encoding network that captures both long-range and short-range dependencies in videos and texts for improved cross-modal video-text retrieval, leveraging multi-scale dilated convolutions and Transformer-based language modeling.
Contribution
It proposes a novel stacked convolutional architecture with multi-scale dilated convolutions and Transformer-based text encoding for better video-text retrieval performance.
Findings
Outperforms state-of-the-art methods on MSR-VTT and MSVD datasets.
Effectively encodes long-range and short-range dependencies in videos and texts.
Demonstrates robustness of the proposed approach through extensive experiments.
Abstract
Existing dominant approaches for cross-modal video-text retrieval task are to learn a joint embedding space to measure the cross-modal similarity. However, these methods rarely explore long-range dependency inside video frames or textual words leading to insufficient textual and visual details. In this paper, we propose a stacked convolutional deep encoding network for video-text retrieval task, which considers to simultaneously encode long-range and short-range dependency in the videos and texts. Specifically, a multi-scale dilated convolutional (MSDC) block within our approach is able to encode short-range temporal cues between video frames or text words by adopting different scales of kernel size and dilation size of convolutional layer. A stacked structure is designed to expand the receptive fields by repeatedly adopting the MSDC block, which further captures the long-range…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
