Video Affective Effects Prediction with Multi-modal Fusion and Shot-Long Temporal Context
Jie Zhang, Yin Zhao, Longjun Cai, Chaoping Tu, Wu Wei

TL;DR
This paper introduces a novel multi-modal fusion framework with shot-long temporal context modeling for predicting emotional impact in videos, significantly improving accuracy over existing methods.
Contribution
The paper proposes a comprehensive framework with modality-specific feature extraction, two-scale temporal structures, and a residual-based progressive fusion strategy for emotion prediction.
Findings
Achieved superior performance on the LIRIS-ACCEDE dataset.
Effectively models intra- and inter-clip temporal dependencies.
Enhances multi-modal fusion with residual-based training.
Abstract
Predicting the emotional impact of videos using machine learning is a challenging task considering the varieties of modalities, the complicated temporal contex of the video as well as the time dependency of the emotional states. Feature extraction, multi-modal fusion and temporal context fusion are crucial stages for predicting valence and arousal values in the emotional impact, but have not been successfully exploited. In this paper, we propose a comprehensive framework with novel designs of modal structure and multi-modal fusion strategy. We select the most suitable modalities for valence and arousal tasks respectively and each modal feature is extracted using the modality-specific pre-trained deep model on large generic dataset. Two-time-scale structures, one for the intra-clip and the other for the inter-clip, are proposed to capture the temporal dependency of video content and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Emotion and Mood Recognition · Video Analysis and Summarization
