Bidirectional Attentive Fusion with Context Gating for Dense Video   Captioning

Jingwen Wang; Wenhao Jiang; Lin Ma; Wei Liu; Yong Xu

arXiv:1804.00100·cs.CV·April 4, 2018·39 cites

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a bidirectional proposal method and an attentive fusion approach with context gating for dense video captioning, significantly improving event localization and description accuracy by leveraging both past and future contexts.

Contribution

It proposes a novel bidirectional proposal mechanism and an attentive fusion with context gating, enhancing dense video captioning performance over previous methods.

Findings

01

Outperforms state-of-the-art on ActivityNet Captions dataset

02

Achieves over 100% relative improvement in Meteor score

03

Demonstrates the effectiveness of bidirectional context and attentive fusion

Abstract

Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JaywongWang/DenseVideoCaptioning
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization