To Find Where You Talk: Temporal Sentence Localization in Video with   Attention Based Location Regression

Yitian Yuan; Tao Mei; Wenwu Zhu

arXiv:1804.07014·cs.CV·November 6, 2018·27 cites

To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression

Yitian Yuan, Tao Mei, Wenwu Zhu

PDF

Open Access

TL;DR

This paper introduces an attention-based approach for localizing specific sentence descriptions within untrimmed videos by leveraging global context and multi-modal co-attention mechanisms, improving accuracy and efficiency.

Contribution

The paper presents a novel Attention Based Location Regression (ABLR) method that encodes video and sentence context, employs co-attention for crucial detail highlighting, and predicts temporal boundaries end-to-end.

Findings

01

Outperforms existing methods on ActivityNet Captions and TACoS datasets.

02

Effectively captures global video structure and sentence details.

03

Demonstrates improved localization accuracy and computational efficiency.

Abstract

Given an untrimmed video and a sentence description, temporal sentence localization aims to automatically determine the start and end points of the described sentence within the video. The problem is challenging as it needs the understanding of both video and sentence. Existing research predominantly employs a costly "scan and localize" framework, neglecting the global video context and the specific details within sentences which play as critical issues for this problem. In this paper, we propose a novel Attention Based Location Regression (ABLR) approach to solve the temporal sentence localization from a global perspective. Specifically, to preserve the context information, ABLR first encodes both video and sentence via Bidirectional LSTM networks. Then, a multi-modal co-attention mechanism is introduced to generate not only video attention which reflects the global video structure,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory