Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal
Zhiyuan Peng, Siyuan Feng, Tan Lee

TL;DR
This paper introduces a novel unsupervised deep auto-encoder that factorizes speech into linguistic and speaker factors using discrete and continuous representations, improving speaker verification and subword modeling.
Contribution
The proposed mixture factorized auto-encoder (mFAE) uniquely combines discrete and continuous representations for speech factorization without supervision.
Findings
Utterance embedder extracts speaker-discriminative embeddings comparable to baselines.
Frame tokenizer captures linguistic content effectively.
Model performs well on speaker verification and subword modeling tasks.
Abstract
Speech signal is constituted and contributed by various informative factors, such as linguistic content and speaker characteristic. There have been notable recent studies attempting to factorize speech signal into these individual factors without requiring any annotation. These studies typically assume continuous representation for linguistic content, which is not in accordance with general linguistic knowledge and may make the extraction of speaker information less successful. This paper proposes the mixture factorized auto-encoder (mFAE) for unsupervised deep factorization. The encoder part of mFAE comprises a frame tokenizer and an utterance embedder. The frame tokenizer models linguistic content of input speech with a discrete categorical distribution. It performs frame clustering by assigning each frame a soft mixture label. The utterance embedder generates an utterance-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
