Mixture factorized auto-encoder for unsupervised hierarchical deep   factorization of speech signal

Zhiyuan Peng; Siyuan Feng; Tan Lee

arXiv:1911.01806·eess.AS·November 6, 2019·1 cites

Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

Zhiyuan Peng, Siyuan Feng, Tan Lee

PDF

Open Access

TL;DR

This paper introduces a novel unsupervised deep auto-encoder that factorizes speech into linguistic and speaker factors using discrete and continuous representations, improving speaker verification and subword modeling.

Contribution

The proposed mixture factorized auto-encoder (mFAE) uniquely combines discrete and continuous representations for speech factorization without supervision.

Findings

01

Utterance embedder extracts speaker-discriminative embeddings comparable to baselines.

02

Frame tokenizer captures linguistic content effectively.

03

Model performs well on speaker verification and subword modeling tasks.

Abstract

Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing