Speaker Characterization by means of Attention Pooling
Federico Costa, Miquel India, Javier Hernando

TL;DR
This paper demonstrates that a Double Multi-Head Self-Attention pooling mechanism, originally used for speaker verification, can be effectively adapted to other speaker characterization tasks like emotion recognition, sex classification, and COVID-19 detection.
Contribution
The paper introduces the adaptation of a self-attention pooling architecture to various speaker characterization tasks, expanding its applicability beyond speaker verification.
Findings
Excellent results in emotion recognition, sex classification, and COVID-19 detection.
Self-attention pooling effectively captures relevant speech features across tasks.
Architecture improves feature relevance selection compared to traditional pooling methods.
Abstract
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
