TL;DR
This paper presents a two-stage late-fusion method combining acoustic and text features for dimensional speech emotion recognition, achieving higher accuracy than previous approaches by using deep learning and SVM.
Contribution
It introduces a novel two-stage fusion framework that separately trains acoustic and text models and then combines their predictions with SVM for improved emotion recognition.
Findings
Outperforms single-modality models in emotion prediction accuracy.
Achieves higher concordance correlation coefficients than early fusion methods.
Demonstrates the effectiveness of late fusion in dimensional emotion modeling.
Abstract
Automatic speech emotion recognition (SER) by a computer is a critical component for more natural human-machine interaction. As in human-human interaction, the capability to perceive emotion correctly is essential to take further steps in a particular situation. One issue in SER is whether it is necessary to combine acoustic features with other data such as facial expressions, text, and motion capture. This research proposes to combine acoustic and text information by applying a late-fusion approach consisting of two steps. First, acoustic and text features are trained separately in deep learning systems. Second, the prediction results from the deep learning systems are fed into a support vector machine (SVM) to predict the final regression score. Furthermore, the task in this research is dimensional emotion modeling because it can enable a deeper analysis of affective states.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
