Transport-Oriented Feature Aggregation for Speaker Embedding Learning
Yusheng Tian, Jingyu Li, Tan Lee

TL;DR
This paper introduces a transport-oriented feature aggregation method for speaker embedding learning, capturing the geometric structure of feature distributions to improve speaker verification performance.
Contribution
It proposes a novel transport-oriented pooling approach that encodes distribution geometry, extending it with an attention mechanism for enhanced speaker modeling.
Findings
Outperforms traditional statistics pooling methods.
Improves speaker verification accuracy on Voxceleb dataset.
Incorporates attention mechanism for weighted feature aggregation.
Abstract
Pooling is needed to aggregate frame-level features into utterance-level representations for speaker modeling. Given the success of statistics-based pooling methods, we hypothesize that speaker characteristics are well represented in the statistical distribution over the pre-aggregation layer's output, and propose to use transport-oriented feature aggregation for deriving speaker embeddings. The aggregated representation encodes the geometric structure of the underlying feature distribution, which is expected to contain valuable speaker-specific information that may not be represented by the commonly used statistical measures like mean and variance. The original transport-oriented feature aggregation is also extended to a weighted-frame version to incorporate the attention mechanism. Experiments on speaker verification with the Voxceleb dataset show improvement over statistics pooling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
