Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020
Teng Wang, Huicheng Zheng, Mingjing Yu

TL;DR
This paper describes a two-stage dense video captioning system submitted to ActivityNet Challenge 2020, which detects events and generates detailed captions by modeling temporal relationships and multi-modal data.
Contribution
It introduces a multi-event captioning model that captures event-level temporal relationships and fuses multi-modal information for dense video captioning.
Findings
Achieved a 9.28 METEOR score on the test set.
Proposed a two-stage pipeline with event proposal and captioning.
Effectively models temporal relationships and multi-modal fusion.
Abstract
This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020. Our approach follows a two-stage pipeline: first, we extract a set of temporal event proposals; then we propose a multi-event captioning model to capture the event-level temporal relationships and effectively fuse the multi-modal information. Our approach achieves a 9.28 METEOR score on the test set.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
