Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Ge Zhu; Fei Jiang; Zhiyao Duan

arXiv:2010.12951·eess.AS·June 10, 2021·1 cites

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Ge Zhu, Fei Jiang, Zhiyao Duan

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-scale waveform encoder for speaker verification that outperforms existing raw-waveform-based methods by leveraging convolutional branches at different time scales and advanced feature aggregation.

Contribution

The paper proposes a novel multi-scale waveform encoder with three convolution branches, squeeze-and-excitation blocks, and TDNN for improved speaker embedding from raw waveforms.

Findings

01

Outperforms existing raw-waveform-based speaker embeddings

02

Attends to different frequency bands at various scales

03

Produces a flatter overall frequency response

Abstract

State-of-the-art text-independent speaker verification systems typically use cepstral features or filter bank energies as speech features. Recent studies attempted to extract speaker embeddings directly from raw waveforms and have shown competitive results. In this paper, we propose a novel multi-scale waveform encoder that uses three convolution branches with different time scales to compute speech features from the waveform. These features are then processed by squeeze-and-excitation blocks, a multi-level feature aggregator, and a time delayed neural network (TDNN) to compute speaker embedding. We show that the proposed embeddings outperform existing raw-waveform-based speaker embeddings on speaker verification by a large margin. A further analysis of the learned filters shows that the multi-scale encoder attends to different frequency bands at its different scales while resulting in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gzhu06/Y-vector
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConvolution