Integrating Temporal and Spatial Attentions for VATEX Video Captioning   Challenge 2019

Shizhe Chen; Yida Zhao; Yuqing Song; Qin Jin; Qi Wu

arXiv:1910.06737·cs.CV·October 16, 2019

Integrating Temporal and Spatial Attentions for VATEX Video Captioning Challenge 2019

Shizhe Chen, Yida Zhao, Yuqing Song, Qin Jin, Qi Wu

PDF

Open Access

TL;DR

This paper introduces a model for video captioning that combines temporal and spatial attention mechanisms, achieving high performance in the VATEX challenge by effectively capturing actions and objects.

Contribution

The novel integration of temporal and spatial attentions with late fusion for improved video captioning performance.

Findings

01

Achieved 73.4 CIDEr score on the VATEX test set

02

Ranked second in the VATEX 2019 challenge

03

Significantly outperformed baseline models

Abstract

This notebook paper presents our model in the VATEX video captioning challenge. In order to capture multi-level aspects in the video, we propose to integrate both temporal and spatial attentions for video captioning. The temporal attentive module focuses on global action movements while spatial attentive module enables to describe more fine-grained objects. Considering these two types of attentive modules are complementary, we thus fuse them via a late fusion strategy. The proposed model significantly outperforms baselines and achieves 73.4 CIDEr score on the testing set which ranks the second place at the VATEX video captioning challenge leaderboard 2019.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization