TL;DR
This paper presents a deep learning approach combining CNN and RNN for emotion recognition in videos, utilizing large datasets like Aff-Wild2, and addresses overfitting by dataset balancing techniques.
Contribution
The study introduces a CNN-RNN model for dimensional emotion prediction on video data and analyzes neural network contributions, improving performance through dataset balancing.
Findings
Effective feature extraction with CNN from video frames.
Temporal dynamics captured by RNN enhance emotion prediction.
Balancing dataset reduces overfitting and improves accuracy.
Abstract
For many years, the emotion recognition task has remained one of the most interesting and important problems in the field of human-computer interaction. In this study, we consider the emotion recognition task as a classification as well as a regression task by processing encoded emotions in different datasets using deep learning models. Our model combines convolutional neural network (CNN) with recurrent neural network (RNN) to predict dimensional emotions on video data. At the first step, CNN extracts feature vectors from video frames. In the second step, we fed these feature vectors to train RNN for exploiting the temporal dynamics of video. Furthermore, we analyzed how each neural network contributes to the system's overall performance. The experiments are performed on publicly available datasets including the largest modern Aff-Wild2 database. It contains over sixty hours of video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
