Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid   Reward Strategies for Video Captioning

Xinxin Zhu; Longteng Guo; Peng Yao; Shichen Lu; Wei Liu; Jing Liu

arXiv:1910.11102·cs.CV·June 25, 2020·1 cites

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, Jing Liu

PDF

Open Access

TL;DR

This paper presents a solution for the VATEX Video Captioning Challenge 2020, utilizing multi-view features, hybrid reward strategies, and ensemble methods to improve multilingual video captioning performance.

Contribution

It introduces an improved multi-view feature integration and hybrid reward approach, building upon previous methods to enhance captioning accuracy in both English and Chinese.

Findings

01

Achieved significant performance improvements over previous year

02

Demonstrated effectiveness of multi-view features and hybrid reward strategies

03

Secured competitive results on both language tracks

Abstract

This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages. We identified three crucial factors that improve the performance, namely: multi-view features, hybrid reward, and diverse ensemble. Based on our method of VATEX 2019 challenge, we achieved significant improvements this year with more advanced model architectures, combination of appearance and motion features, and careful hyper-parameters tuning. Our method achieves very competitive results on both of the Chinese and English video captioning tracks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization