Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations

Ryo Aihara; Yoshiki Masuyama; Gordon Wichern; Fran\c{c}ois G. Germain; Jonathan Le Roux

arXiv:2508.08399·eess.AS·August 13, 2025

Exploring Disentangled Neural Speech Codecs from Self-Supervised Representations

Ryo Aihara, Yoshiki Masuyama, Gordon Wichern, Fran\c{c}ois G. Germain, Jonathan Le Roux

PDF

Open Access

TL;DR

This paper introduces a neural audio codec that explicitly disentangles linguistic and paralinguistic features using self-supervised representations, enabling flexible speech manipulation like voice conversion without sacrificing reconstruction quality.

Contribution

It presents a novel discrete neural audio codec that achieves structured disentanglement of speech features, improving flexibility for downstream tasks such as voice conversion.

Findings

01

Achieves comparable reconstruction performance to traditional NACs.

02

Effectively disentangles linguistic and paralinguistic features.

03

Matches the effectiveness of conventional voice conversion techniques.

Abstract

Neural audio codecs (NACs), which use neural networks to generate compact audio representations, have garnered interest for their applicability to many downstream tasks -- especially quantized codecs due to their compatibility with large language models. However, unlike text, speech conveys not only linguistic content but also rich paralinguistic features. Encoding these elements in an entangled fashion may be suboptimal, as it limits flexibility. For instance, voice conversion (VC) aims to convert speaker characteristics while preserving the original linguistic content, which requires a disentangled representation. Inspired by VC methods utilizing $k$ -means quantization with self-supervised features to disentangle phonetic information, we develop a discrete NAC capable of structured disentanglement. Experimental evaluations show that our approach achieves reconstruction performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Generative Adversarial Networks and Image Synthesis · Neural Networks and Applications