TL;DR
This paper introduces a novel pooling method for speaker embeddings that uses channel-wise correlations, inspired by style transfer in computer vision, and demonstrates its effectiveness on VoxCeleb.
Contribution
It proposes a new pooling technique based on channel correlations for speaker embeddings, improving recognition performance.
Findings
Channel-wise correlation pooling outperforms average pooling.
The method enhances speaker recognition accuracy on VoxCeleb.
Correlation features effectively capture speaker characteristics.
Abstract
Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statistics. The method is inspired by style-transfer methods in computer vision, where the style of an image, modeled by the matrix of channel-wise correlations, is transferred to another image, in order to produce a new image having the style of the first and the content of the second. By drawing analogies between image style and speaker characteristics, and between image content and phonetic sequence, we explore the use of such channel-wise correlations features to train a ResNet architecture in an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsResidual Connection · 1x1 Convolution · Average Pooling · Residual Block · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Bottleneck Residual Block · Max Pooling · Convolution · Kaiming Initialization
