Attentive Sequence to Sequence Translation for Localizing Clips of Interest by Natural Language Descriptions
Ke Ning, Linchao Zhu, Ming Cai, Yi Yang, Di Xie, Fei Wu

TL;DR
This paper introduces a novel attentive sequence-to-sequence model for localizing video clips based on natural language descriptions, leveraging a bi-directional RNN and hierarchical architecture for detailed understanding and matching.
Contribution
The paper presents a new bi-directional RNN with an attentive mechanism and a hierarchical architecture for improved video clip localization from natural language descriptions.
Findings
Outperforms state-of-the-art by 4.28% in Rank@1 on DiDeMo.
Achieves 13.41% improvement in Rank@1, IoU=0.5 on Charades-STA.
Demonstrates effective multi-granularity modeling of video content and language.
Abstract
We propose a novel attentive sequence to sequence translator (ASST) for clip localization in videos by natural language descriptions. We make two contributions. First, we propose a bi-directional Recurrent Neural Network (RNN) with a finely calibrated vision-language attentive mechanism to comprehensively understand the free-formed natural language descriptions. The RNN parses natural language descriptions in two directions, and the attentive model attends every meaningful word or phrase to each frame, thereby resulting in a more detailed understanding of video content and description semantics. Second, we design a hierarchical architecture for the network to jointly model language descriptions and video content. Given a video-description pair, the network generates a matrix representation, i.e., a sequence of vectors. Each vector in the matrix represents a video frame conditioned by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
