SPEAKER VGG CCT: Cross-corpus Speech Emotion Recognition with Speaker Embedding and Vision Transformers
A. Arezzo, S. Berretti

TL;DR
This paper introduces a novel speech emotion recognition method combining Compact Convolutional Transformers with speaker embeddings, enabling effective cross-corpus performance with limited data and real-time operation.
Contribution
The paper proposes a new SER approach using CCTs and speaker embeddings, improving cross-corpus recognition and reducing data requirements compared to existing methods.
Findings
Achieves comparable or superior results to state-of-the-art architectures.
Operates in real-time with promising cross-corpus performance.
Effective with limited training data in diverse datasets.
Abstract
In recent years, Speech Emotion Recognition (SER) has been investigated mainly transforming the speech signal into spectrograms that are then classified using Convolutional Neural Networks pretrained on generic images and fine tuned with spectrograms. In this paper, we start from the general idea above and develop a new learning solution for SER, which is based on Compact Convolutional Transformers (CCTs) combined with a speaker embedding. With CCTs, the learning power of Vision Transformers (ViT) is combined with a diminished need for large volume of data as made possible by the convolution. This is important in SER, where large corpora of data are usually not available. The speaker embedding allows the network to extract an identity representation of the speaker, which is then integrated by means of a self-attention mechanism with the features that the CCT extracts from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Softmax · Label Smoothing · Adam · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings
