Disentangled Feature Learning for Real-Time Neural Speech Coding
Xue Jiang, Xiulian Peng, Yuan Zhang, Yan Lu

TL;DR
This paper introduces a novel neural speech coding method that learns disentangled global and local features, improving coding efficiency and enabling real-time voice conversion with less parameters and latency.
Contribution
It proposes a disentangled feature learning approach for neural speech coding, enhancing efficiency and enabling real-time audio editing like voice conversion.
Findings
Achieves better coding efficiency through feature disentanglement.
Enables real-time voice conversion with fewer parameters.
Demonstrates comparable performance to state-of-the-art models.
Abstract
Recently end-to-end neural audio/speech coding has shown its great potential to outperform traditional signal analysis based audio codecs. This is mostly achieved by following the VQ-VAE paradigm where blind features are learned, vector-quantized and coded. In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding. Specifically, more global-like speaker identity and local content features are learned with disentanglement to represent speech. Such a compact feature decomposition not only achieves better coding efficiency by exploiting bit allocation among different features but also provides the flexibility to do audio editing in embedding space, such as voice conversion in real-time communications. Both subjective and objective results demonstrate its coding efficiency and we find that the learned disentangled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsVQ-VAE
