Best Vision Technologies Submission to ActivityNet Challenge 2018-Task: Dense-Captioning Events in Videos
Yuan Liu, Moyini Yao

TL;DR
This paper presents a two-stage approach for dense video captioning in ActivityNet Challenge 2018, combining temporal event proposal with LSTM-based sentence generation using RGB and optical flow inputs.
Contribution
It introduces a novel dense-captioning framework that integrates a three-stage event proposal with a dual-input LSTM captioning model employing temporal attention and late fusion.
Findings
Effective temporal event proposals based on existing workflows.
Improved captioning accuracy through RGB and optical flow fusion.
Demonstrated success in ActivityNet Challenge 2018.
Abstract
This note describes the details of our solution to the dense-captioning events in videos task of ActivityNet Challenge 2018. Specifically, we solve this problem with a two-stage way, i.e., first temporal event proposal and then sentence generation. For temporal event proposal, we directly leverage the three-stage workflow in [13, 16]. For sentence generation, we capitalize on LSTM-based captioning framework with temporal attention mechanism (dubbed as LSTM-T). Moreover, the input visual sequence to the LSTM-based video captioning model is comprised of RGB and optical flow images. At inference, we adopt a late fusion scheme to fuse the two LSTM-based captioning models for sentence generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
