TL;DR
AttendAffectNet introduces self-attention based neural networks that integrate audio-visual features to improve emotion prediction from movies, outperforming existing models on benchmark datasets.
Contribution
The paper presents novel self-attention architectures that effectively combine multi-modal features for emotion prediction in movies, demonstrating superior performance over prior methods.
Findings
Self-attention on audio-visual features outperforms temporal self-attention.
Proposed models outperform state-of-the-art emotion prediction methods.
Effective multi-modal integration enhances emotion recognition accuracy.
Abstract
In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
