Jointly Localizing and Describing Events for Dense Video Captioning
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei

TL;DR
This paper introduces a unified end-to-end framework for dense video captioning that jointly localizes events and generates descriptive sentences, improving accuracy and achieving new state-of-the-art results on ActivityNet Captions.
Contribution
It proposes a novel joint training approach combining event localization and captioning with descriptiveness regression for better dense video captioning.
Findings
Achieved a new METEOR score of 12.96% on ActivityNet Captions.
Demonstrated clear improvements over existing methods.
Validated the effectiveness of joint optimization in dense video captioning.
Abstract
Automatically describing a video with natural language is regarded as a fundamental challenge in computer vision. The problem nevertheless is not trivial especially when a video contains multiple events to be worthy of mention, which often happens in real videos. A valid question is how to temporally localize and then describe events, which is known as "dense video captioning." In this paper, we present a novel framework for dense video captioning that unifies the localization of temporal event proposals and sentence generation of each proposal, by jointly training them in an end-to-end manner. To combine these two worlds, we integrate a new design, namely descriptiveness regression, into a single shot detection structure to infer the descriptive complexity of each detected proposal via sentence generation. This in turn adjusts the temporal locations of each event proposal. Our model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
