Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu

TL;DR
This paper introduces a bidirectional proposal method and an attentive fusion approach with context gating for dense video captioning, significantly improving event localization and description accuracy by leveraging both past and future contexts.
Contribution
It proposes a novel bidirectional proposal mechanism and an attentive fusion with context gating, enhancing dense video captioning performance over previous methods.
Findings
Outperforms state-of-the-art on ActivityNet Captions dataset
Achieves over 100% relative improvement in Meteor score
Demonstrates the effectiveness of bidirectional context and attentive fusion
Abstract
Dense video captioning is a newly emerging task that aims at both localizing and describing all events in a video. We identify and tackle two challenges on this task, namely, (1) how to utilize both past and future contexts for accurate event proposal predictions, and (2) how to construct informative input to the decoder for generating natural event descriptions. First, previous works predominantly generate temporal event proposals in the forward direction, which neglects future video context. We propose a bidirectional proposal method that effectively exploits both past and future contexts to make proposal predictions. Second, different events ending at (nearly) the same time are indistinguishable in the previous works, resulting in the same captions. We solve this problem by representing each event with an attentive fusion of hidden states from the proposal module and video contents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
