PSCodec: A Series of High-Fidelity Low-bitrate Neural Speech Codecs Leveraging Prompt Encoders
Yu Pan, Xiang Zhang, Yuguang Yang, Jixun Yao, Yanni Hu, Jianhao Ye,, Hongbin Zhou, Lei Ma, Jianjun Zhao

TL;DR
This paper introduces PSCodec, a series of neural speech codecs using prompt encoders that achieve high-quality speech reconstruction at low bitrates, advancing speech compression technology.
Contribution
The paper presents three novel neural speech codecs leveraging prompt encoders, including a new disentanglement method and an attention network to improve low-bitrate speech quality.
Findings
All three codecs outperform state-of-the-art neural codecs in quality and speaker similarity.
PSCodec-DRL-ICT achieves high performance but requires extensive tuning.
PSCodec-CasAN offers a less labor-intensive alternative with comparable results.
Abstract
Neural speech codecs have recently emerged as a focal point in the fields of speech compression and generation. Despite this progress, achieving high-quality speech reconstruction under low-bitrate scenarios remains a significant challenge. In this paper, we propose PSCodec, a series of neural speech codecs based on prompt encoders, comprising PSCodec-Base, PSCodec-DRL-ICT, and PSCodec-CasAN, which are capable of delivering high-performance speech reconstruction with low bandwidths. Specifically, we first introduce PSCodec-Base, which leverages a pretrained speaker verification model-based prompt encoder (VPP-Enc) and a learnable Mel-spectrogram-based prompt encoder (MelP-Enc) to effectively disentangle and integrate voiceprint and Mel-related features in utterances. To further enhance feature utilization efficiency, we propose PSCodec-DRL-ICT, incorporating a structural similarity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
