Dense-Captioning Events in Videos: SYSU Submission to ActivityNet   Challenge 2020

Teng Wang; Huicheng Zheng; Mingjing Yu

arXiv:2006.11693·cs.CV·August 13, 2020·5 cites

Dense-Captioning Events in Videos: SYSU Submission to ActivityNet Challenge 2020

Teng Wang, Huicheng Zheng, Mingjing Yu

PDF

Open Access 1 Repo

TL;DR

This paper describes a two-stage dense video captioning system submitted to ActivityNet Challenge 2020, which detects events and generates detailed captions by modeling temporal relationships and multi-modal data.

Contribution

It introduces a multi-event captioning model that captures event-level temporal relationships and fuses multi-modal information for dense video captioning.

Findings

01

Achieved a 9.28 METEOR score on the test set.

02

Proposed a two-stage pipeline with event proposal and captioning.

03

Effectively models temporal relationships and multi-modal fusion.

Abstract

This technical report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020. Our approach follows a two-stage pipeline: first, we extract a set of temporal event proposals; then we propose a multi-event captioning model to capture the event-level temporal relationships and effectively fuse the multi-modal information. Our approach achieves a 9.28 METEOR score on the test set.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ttengwang/dense-video-captioning-pytorch
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization