Frequency and temporal convolutional attention for text-independent speaker recognition
Sarthak Yadav, Atul Rai

TL;DR
This paper introduces convolutional attention modules for CNN-based speaker recognition, modeling frequency and temporal information separately, leading to state-of-the-art results on VoxCeleb with improved robustness in real-world conditions.
Contribution
It proposes convolutional attention methods for frequency and temporal modeling in CNNs, enhancing speaker recognition performance over existing baselines.
Findings
Achieves 2.031% EER on VoxCeleb1, setting new state-of-the-art.
Convolutional attention modules outperform no-attention and spatial-CBAM baselines.
Simultaneous modeling of frequency and temporal attention improves real-world robustness.
Abstract
Majority of the recent approaches for text-independent speaker recognition apply attention or similar techniques for aggregation of frame-level feature descriptors generated by a deep neural network (DNN) front-end. In this paper, we propose methods of convolutional attention for independently modelling temporal and frequency information in a convolutional neural network (CNN) based front-end. Our system utilizes convolutional block attention modules (CBAMs) [1] appropriately modified to accommodate spectrogram inputs. The proposed CNN front-end fitted with the proposed convolutional attention modules outperform the no-attention and spatial-CBAM baselines by a significant margin on the VoxCeleb [2, 3] speaker verification benchmark, and our best model achieves an equal error rate of 2:031% on the VoxCeleb1 test set, improving the existing state of the art result by a significant margin.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
