TL;DR
This study develops multimodal deep learning models combining video and audio features to predict viewers' emotional responses to movies, finding audio features more predictive and optical flow more informative than raw video content.
Contribution
Introduces hybrid multimodal models using both visual and audio features, with a comparison of sequential and non-sequential neural network approaches for emotion prediction.
Findings
Audio features outperform video features in emotion prediction.
Optical flow features are more informative than RGB frames.
Predicting emotions independently per time step slightly outperforms LSTM-based sequential models.
Abstract
The goal of this study is to develop and analyze multimodal models for predicting experienced affective responses of viewers watching movie clips. We develop hybrid multimodal prediction models based on both the video and audio of the clips. For the video content, we hypothesize that both image content and motion are crucial features for evoked emotion prediction. To capture such information, we extract features from RGB frames and optical flow using pre-trained neural networks. For the audio model, we compute an enhanced set of low-level descriptors including intensity, loudness, cepstrum, linear predictor coefficients, pitch and voice quality. Both visual and audio features are then concatenated to create audio-visual features, which are used to predict the evoked emotion. To classify the movie clips into the corresponding affective response categories, we propose two approaches based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
