Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations
Weiwei Lin, Chenhang He, Man-Wai Mak, Youzhi Tu

TL;DR
This paper introduces a novel self-supervised factor analysis approach that disentangles utterance-level speech representations, enabling improved performance on speaker, emotion, and language recognition tasks with limited labeled data.
Contribution
The proposed FA-based model uses hidden acoustic units for utterance-level learning, enhancing SSL speech models for non-semantic tasks without extensive supervision.
Findings
Outperforms WavLM on SUPERB benchmark tasks
Achieves high accuracy with only 20% labeled data
Effectively disentangles speech content from speaker/emotion/language features
Abstract
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Adam · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding
