Best Vision Technologies Submission to ActivityNet Challenge 2018-Task:   Dense-Captioning Events in Videos

Yuan Liu; Moyini Yao

arXiv:1806.09278·cs.CV·June 26, 2018·1 cites

Best Vision Technologies Submission to ActivityNet Challenge 2018-Task: Dense-Captioning Events in Videos

Yuan Liu, Moyini Yao

PDF

Open Access

TL;DR

This paper presents a two-stage approach for dense video captioning in ActivityNet Challenge 2018, combining temporal event proposal with LSTM-based sentence generation using RGB and optical flow inputs.

Contribution

It introduces a novel dense-captioning framework that integrates a three-stage event proposal with a dual-input LSTM captioning model employing temporal attention and late fusion.

Findings

01

Effective temporal event proposals based on existing workflows.

02

Improved captioning accuracy through RGB and optical flow fusion.

03

Demonstrated success in ActivityNet Challenge 2018.

Abstract

This note describes the details of our solution to the dense-captioning events in videos task of ActivityNet Challenge 2018. Specifically, we solve this problem with a two-stage way, i.e., first temporal event proposal and then sentence generation. For temporal event proposal, we directly leverage the three-stage workflow in [13, 16]. For sentence generation, we capitalize on LSTM-based captioning framework with temporal attention mechanism (dubbed as LSTM-T). Moreover, the input visual sequence to the LSTM-based video captioning model is comprised of RGB and optical flow images. At inference, we adopt a late fusion scheme to fuse the two LSTM-based captioning models for sentence generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning