Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning
Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, Jing Liu

TL;DR
This paper presents a solution for the VATEX Video Captioning Challenge 2020, utilizing multi-view features, hybrid reward strategies, and ensemble methods to improve multilingual video captioning performance.
Contribution
It introduces an improved multi-view feature integration and hybrid reward approach, building upon previous methods to enhance captioning accuracy in both English and Chinese.
Findings
Achieved significant performance improvements over previous year
Demonstrated effectiveness of multi-view features and hybrid reward strategies
Secured competitive results on both language tracks
Abstract
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages. We identified three crucial factors that improve the performance, namely: multi-view features, hybrid reward, and diverse ensemble. Based on our method of VATEX 2019 challenge, we achieved significant improvements this year with more advanced model architectures, combination of appearance and motion features, and careful hyper-parameters tuning. Our method achieves very competitive results on both of the Chinese and English video captioning tracks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
