Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations
Ryo Aihara, Yoshiki Masuyama, Gordon Wichern, Fran\c{c}ois G. Germain, Jonathan Le Roux

TL;DR
This paper introduces a neural audio codec that explicitly disentangles linguistic and paralinguistic features using self-supervised representations, enabling flexible speech manipulation like voice conversion without sacrificing reconstruction quality.
Contribution
It presents a novel discrete neural audio codec that achieves structured disentanglement of speech features, improving flexibility for downstream tasks such as voice conversion.
Findings
Achieves comparable reconstruction performance to traditional NACs.
Effectively disentangles linguistic and paralinguistic features.
Matches the effectiveness of conventional voice conversion techniques.
Abstract
Neural audio codecs (NACs), which use neural networks to generate compact audio representations, have garnered interest for their applicability to many downstream tasks -- especially quantized codecs due to their compatibility with large language models. However, unlike text, speech conveys not only linguistic content but also rich paralinguistic features. Encoding these elements in an entangled fashion may be suboptimal, as it limits flexibility. For instance, voice conversion (VC) aims to convert speaker characteristics while preserving the original linguistic content, which requires a disentangled representation. Inspired by VC methods utilizing -means quantization with self-supervised features to disentangle phonetic information, we develop a discrete NAC capable of structured disentanglement. Experimental evaluations show that our approach achieves reconstruction performance on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Neural Networks and Applications
