PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs   Leveraging Prompt Encoders

Yu Pan; Xiang Zhang; Yuguang Yang; Jixun Yao; Yanni Hu; Jianhao Ye,; Hongbin Zhou; Lei Ma; Jianjun Zhao

arXiv:2404.02702·cs.SD·November 22, 2024·2 cites

PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders

Yu Pan, Xiang Zhang, Yuguang Yang, Jixun Yao, Yanni Hu, Jianhao Ye,, Hongbin Zhou, Lei Ma, Jianjun Zhao

PDF

Open Access

TL;DR

This paper introduces PSCodec, a series of neural speech codecs using prompt encoders that achieve high-quality speech reconstruction at low bitrates, advancing speech compression technology.

Contribution

The paper presents three novel neural speech codecs leveraging prompt encoders, including a new disentanglement method and an attention network to improve low-bitrate speech quality.

Findings

01

All three codecs outperform state-of-the-art neural codecs in quality and speaker similarity.

02

PSCodec-DRL-ICT achieves high performance but requires extensive tuning.

03

PSCodec-CasAN offers a less labor-intensive alternative with comparable results.

Abstract

Neural speech codecs have recently emerged as a focal point in the fields of speech compression and generation. Despite this progress, achieving high-quality speech reconstruction under low-bitrate scenarios remains a significant challenge. In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths. Specifically, we first introduce PSCodec-Base, which leverages a pretrained speaker verification model-based prompt encoder (VPP-Enc) and a learnable Mel-spectrogram-based prompt encoder (MelP-Enc) to effectively disentangle and integrate voiceprint and Mel-related features in utterances. To further enhance feature utilization efficiency, we propose PSCodec-DRL-ICT, incorporating a structural similarity…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques