Fine-grained Video Attractiveness Prediction Using Multimodal Deep Learning on a Large Real-world Dataset
Xinpeng Chen, Jingyuan Chen, Lin Ma, Jian Yao, Wei Liu and, Jiebo Luo, Tong Zhang

TL;DR
This paper introduces the first large-scale, fine-grained video attractiveness dataset and develops multimodal deep learning models to predict viewer engagement at the segment level, demonstrating the importance of visual and audio features.
Contribution
The paper creates a novel, large-scale dataset for fine-grained video attractiveness prediction and proposes multimodal sequential models that leverage visual and audio data.
Findings
Multimodal models outperform single-modality models.
Visual and audio features are both essential for accurate prediction.
Models effectively capture different viewer engagement behaviors.
Abstract
Nowadays, billions of videos are online ready to be viewed and shared. Among an enormous volume of videos, some popular ones are widely viewed by online users while the majority attract little attention. Furthermore, within each video, different segments may attract significantly different numbers of views. This phenomenon leads to a challenging yet important problem, namely fine-grained video attractiveness prediction. However, one major obstacle for such a challenging problem is that no suitable benchmark dataset currently exists. To this end, we construct the first fine-grained video attractiveness dataset, which is collected from one of the most popular video websites in the world. In total, the constructed FVAD consists of 1,019 drama episodes with 780.6 hours covering different categories and a wide variety of video contents. Apart from the large amount of videos, hundreds of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
