Studying squeeze-and-excitation used in CNN for speaker verification
Mickael Rouvier, Pierre-Michel Bousquet

TL;DR
This paper investigates the impact of squeeze-and-excitation mechanisms in CNNs for speaker verification, proposing a new pooling method that enhances speaker discrimination across various datasets.
Contribution
It provides a qualitative analysis of SE in speaker verification and introduces a novel pooling variant that improves performance in ResNet architectures.
Findings
Applying SE in early ResNet stages improves speaker feature extraction.
The proposed pooling method enhances verification accuracy on Voxceleb and SITW datasets.
SE mechanisms with the new pooling yield significant discrimination gains.
Abstract
In speaker verification, the extraction of voice representations is mainly based on the Residual Neural Network (ResNet) architecture. ResNet is built upon convolution layers which learn filters to capture local spatial patterns along all the input, then generate feature maps that jointly encode the spatial and channel information. Unfortunately, all feature maps in a convolution layer are learnt independently (the convolution layer does not exploit the dependencies between feature maps) and locally. This problem has first been tackled in image processing. A channel attention mechanism, called squeeze-and-excitation (SE), has recently been proposed in convolution layers and applied to speaker verification. This mechanism re-weights the information extracted across features maps. In this paper, we first propose an original qualitative study about the influence and the role of the SE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Average Pooling · Kaiming Initialization · Global Average Pooling · Batch Normalization · 1x1 Convolution · Max Pooling · Residual Block · Bottleneck Residual Block
