Jointly Localizing and Describing Events for Dense Video Captioning

Yehao Li; Ting Yao; Yingwei Pan; Hongyang Chao; Tao Mei

arXiv:1804.08274·cs.CV·April 24, 2018·38 cites

Jointly Localizing and Describing Events for Dense Video Captioning

Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei

PDF

Open Access

TL;DR

This paper introduces a unified end-to-end framework for dense video captioning that jointly localizes events and generates descriptive sentences, improving accuracy and achieving new state-of-the-art results on ActivityNet Captions.

Contribution

It proposes a novel joint training approach combining event localization and captioning with descriptiveness regression for better dense video captioning.

Findings

01

Achieved a new METEOR score of 12.96% on ActivityNet Captions.

02

Demonstrated clear improvements over existing methods.

03

Validated the effectiveness of joint optimization in dense video captioning.

Abstract

Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. The problem nevertheless is not trivial especially when a video contains multiple events to be worthy of mention, which often happens in real videos. A valid question is how to temporally localize and then describe events, which is known as "dense video captioning." In this paper, we present a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. To combine these two worlds, we integrate a new design, namely descriptiveness regression, into a single shot detection structure to infer the descriptive complexity of each detected proposal via sentence generation. This in turn adjusts the temporal locations of each event proposal. Our model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization