Transport-Oriented Feature Aggregation for Speaker Embedding Learning

Yusheng Tian; Jingyu Li; Tan Lee

arXiv:2206.12857·eess.AS·June 28, 2022

Transport-Oriented Feature Aggregation for Speaker Embedding Learning

Yusheng Tian, Jingyu Li, Tan Lee

PDF

Open Access

TL;DR

This paper introduces a transport-oriented feature aggregation method for speaker embedding learning, capturing the geometric structure of feature distributions to improve speaker verification performance.

Contribution

It proposes a novel transport-oriented pooling approach that encodes distribution geometry, extending it with an attention mechanism for enhanced speaker modeling.

Findings

01

Outperforms traditional statistics pooling methods.

02

Improves speaker verification accuracy on Voxceleb dataset.

03

Incorporates attention mechanism for weighted feature aggregation.

Abstract

Pooling is needed to aggregate frame-level features into utterance-level representations for speaker modeling. Given the success of statistics-based pooling methods, we hypothesize that speaker characteristics are well represented in the statistical distribution over the pre-aggregation layer's output, and propose to use transport-oriented feature aggregation for deriving speaker embeddings. The aggregated representation encodes the geometric structure of the underlying feature distribution, which is expected to contain valuable speaker-specific information that may not be represented by the commonly used statistical measures like mean and variance. The original transport-oriented feature aggregation is also extended to a weighted-frame version to incorporate the attention mechanism. Experiments on speaker verification with the Voxceleb dataset show improvement over statistics pooling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing