Disentangling Textual and Acoustic Features of Neural Speech Representations
Hosein Mohebbi, Grzegorz Chrupa{\l}a, Willem Zuidema, Afra Alishahi,, Ivan Titov

TL;DR
This paper introduces a disentanglement framework based on the Information Bottleneck principle to separate textual and acoustic features in neural speech models, aiding privacy and interpretability in speech processing tasks.
Contribution
It proposes a novel disentanglement method that isolates content and acoustic features in neural speech representations, enabling better analysis and privacy control.
Findings
Effective separation of textual and acoustic features demonstrated
Quantified feature contributions across model layers
Identified salient speech frames for downstream tasks
Abstract
Neural speech models build deeply entangled internal representations, which capture a variety of features (e.g., fundamental frequency, loudness, syntactic category, or semantic content of a word) in a distributed encoding. This complexity makes it difficult to track the extent to which such representations rely on textual and acoustic information, or to suppress the encoding of acoustic features that may pose privacy risks (e.g., gender or speaker identity) in critical, real-world applications. In this paper, we build upon the Information Bottleneck principle to propose a disentanglement framework that separates complex speech representations into two distinct components: one encoding content (i.e., what can be transcribed as text) and the other encoding acoustic features relevant to a given downstream task. We apply and evaluate our framework to emotion recognition and speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
