Self-supervised Neural Factor Analysis for Disentangling Utterance-level   Speech Representations

Weiwei Lin; Chenhang He; Man-Wai Mak; Youzhi Tu

arXiv:2305.08099·cs.SD·October 5, 2023·2 cites

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

Weiwei Lin, Chenhang He, Man-Wai Mak, Youzhi Tu

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel self-supervised factor analysis approach that disentangles utterance-level speech representations, enabling improved performance on speaker, emotion, and language recognition tasks with limited labeled data.

Contribution

The proposed FA-based model uses hidden acoustic units for utterance-level learning, enhancing SSL speech models for non-semantic tasks without extensive supervision.

Findings

01

Outperforms WavLM on SUPERB benchmark tasks

02

Achieves high accuracy with only 20% labeled data

03

Effectively disentangles speech content from speaker/emotion/language features

Abstract

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Adam · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding