Attentive Sequence to Sequence Translation for Localizing Clips of   Interest by Natural Language Descriptions

Ke Ning; Linchao Zhu; Ming Cai; Yi Yang; Di Xie; Fei Wu

arXiv:1808.08803·cs.CV·August 28, 2018

Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions

Ke Ning, Linchao Zhu, Ming Cai, Yi Yang, Di Xie, Fei Wu

PDF

Open Access

TL;DR

This paper introduces a novel attentive sequence-to-sequence model for localizing video clips based on natural language descriptions, leveraging a bi-directional RNN and hierarchical architecture for detailed understanding and matching.

Contribution

The paper presents a new bi-directional RNN with an attentive mechanism and a hierarchical architecture for improved video clip localization from natural language descriptions.

Findings

01

Outperforms state-of-the-art by 4.28% in Rank@1 on DiDeMo.

02

Achieves 13.41% improvement in Rank@1, IoU=0.5 on Charades-STA.

03

Demonstrates effective multi-granularity modeling of video content and language.

Abstract

We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions. First, we propose a bi-directional Recurrent Neural Network (RNN) with a finely calibrated vision-language attentive mechanism to comprehensively understand the free-formed natural language descriptions. The RNN parses natural language descriptions in two directions, and the attentive model attends every meaningful word or phrase to each frame, thereby resulting in a more detailed understanding of video content and description semantics. Second, we design a hierarchical architecture for the network to jointly model language descriptions and video content. Given a video-description pair, the network generates a matrix representation, i.e., a sequence of vectors. Each vector in the matrix represents a video frame conditioned by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling