Towards the Next Frontier in Speech Representation Learning Using Disentanglement

Varun Krishna; Sriram Ganapathy

arXiv:2407.02543·cs.CL·July 22, 2025

Towards the Next Frontier in Speech Representation Learning Using Disentanglement

Varun Krishna, Sriram Ganapathy

PDF

Open Access

TL;DR

This paper introduces Learn2Diss, a novel framework for disentangling speaker and phonemic information in speech representations using joint frame-level and utterance-level encoders, improving various downstream tasks.

Contribution

It proposes a new disentanglement framework combining frame and utterance-level encoders with mutual information criteria for speech representation learning.

Findings

01

Achieves state-of-the-art results on multiple speech tasks.

02

Frame-level representations enhance semantic understanding.

03

Utterance-level representations improve non-semantic tasks.

Abstract

The popular frameworks for self-supervised learning of speech representations have largely focused on frame-level masked prediction of speech regions. While this has shown promising downstream task performance for speech recognition and related tasks, this has largely ignored factors of speech that are encoded at coarser level, like characteristics of the speaker or channel that remain consistent through-out a speech utterance. In this work, we propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules. The two encoders are initially learned independently, where the frame-level model is largely inspired by existing self supervision techniques, thereby learning pseudo-phonemic representations, while the utterance-level encoder is inspired by constrastive learning of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing