Fine-grained Early Frequency Attention for Deep Speaker Representation Learning
Amirhossein Hajavi, Ali Etemad

TL;DR
This paper introduces a lightweight, fine-grained frequency attention mechanism that enhances deep speaker representation learning across multiple speech tasks, improving accuracy and robustness against noise.
Contribution
The paper proposes FEFA, a novel attention module focusing on individual frequency-bins, which can be integrated into CNNs to improve speaker and speech emotion recognition performance.
Findings
FEFA improves performance across speaker and emotion recognition tasks.
Models with FEFA outperform existing attention models.
FEFA enhances robustness against noise in speech data.
Abstract
Deep learning techniques have considerably improved speech processing in recent years. Speaker representations extracted by deep learning models are being used in a wide range of tasks such as speaker recognition and speech emotion recognition. Attention mechanisms have started to play an important role in improving deep learning models in the field of speech processing. Nonetheless, despite the fact that important speaker-related information can be embedded in individual frequency-bins of the input spectral representations, current attention models are unable to attend to fine-grained information items in spectral representations. In this paper we propose Fine-grained Early Frequency Attention (FEFA) for speaker representation learning. Our model is a simple and lightweight model that can be integrated into various CNN pipelines and is capable of focusing on information items as small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
